Into the Black Box: The Hidden Secrets of Oracle’s Cost-Based Optimizer That Will Change How You Think About SQL

Why everything you learned about Oracle optimization is probably wrong—and what the math actually says

The Deterministic Illusion

In the landscape of relational database management systems, the Oracle Cost-Based Optimizer (CBO) stands as a monument to software engineering complexity. It is the brain of the database, tasked with a singular, Herculean objective: to navigate a search space of potential execution plans that, for complex queries, can exceed the number of atoms in the observable universe, and to select the single most efficient path—all within a fraction of a second.

For decades, the CBO has been viewed by many database administrators and developers as a “black box”—an opaque mechanism that ingests SQL statements and statistics, and emits execution plans based on logic that often appears arcane, capricious, or counter-intuitive. When a query that ran in milliseconds yesterday suddenly takes hours today, the finger is often pointed at the CBO’s “instability,” leading to a culture of defensive coding, hint injection, and parameter tweaking based on folklore rather than physics.

Here’s the uncomfortable truth: The CBO is not merely a “cost” engine but a sophisticated time-prediction machine constrained by specific mathematical assumptions. And almost everything the average DBA believes about it is wrong.

This deep dive will dismantle decades of myths. We’ll explore why “Cost” is actually a unit of time, why the Index Clustering Factor routinely lies about data organization, why 5500 rows is a magic number for statistics, why the “First Rows” optimization mode is a heuristic trap, and why the optimizer sometimes deliberately chooses higher-cost plans. Along the way, we’ll examine war stories where these misunderstandings brought production systems to their knees.

Myth #1: “Cost is an Abstract, Unitless Number”

The Popular Belief: When a DBA explains an execution plan to a developer, they often describe Cost as an arbitrary, unitless number—a “score” useful only for comparing Plan A against Plan B for the same query. Even Tom Kyte once stated on AskTom that cost is just an abstract unit for comparing plans within a single query.

The Reality: Jonathan Lewis explicitly debunked this: “Cost is Time. The cost of a query represents the optimizer’s best estimate of how long it will take that query to run.”

The Mathematics of Cost

In the nascent stages of the CBO (Oracle 8i and prior), the cost calculation was indeed relatively abstract—primarily a function of estimated I/O operations. With the advent of System Statistics in Oracle 9i, the CBO underwent a paradigm shift that many practitioners still do not fully internalize.

The fundamental equation governing cost calculation, derived from the CPU Costing Model, is:

            (#SRds × sreadtim) + (#MRds × mreadtim) + (#CPUCycles / cpuspeed)
Cost  =  ———————————————————————————————————————————————————————————————————————
                                      sreadtim

Where:

#SRds: Estimated number of Single Block Reads (random I/O)
sreadtim: Average time to read a single block (milliseconds)
#MRds: Estimated number of Multi-Block Reads (sequential I/O)
mreadtim: Average time to read a multi-block chunk
#CPUCycles: Estimated CPU instructions required
cpuspeed: CPU speed in millions of operations per second

The division by sreadtim in the denominator is critical and often overlooked. A Cost of 100 does not mean “100 I/Os”—it strictly means “The predicted execution time is equivalent to the time it takes to perform 100 single-block reads on this system.”

The Counter-Intuitive Flash Storage Effect

This normalization has profound implications for modern all-flash environments:

Scenario A (Spinning Disk): If sreadtim is high (e.g., 10ms), the I/O component dominates. CPU time (usually microseconds) divides by a large denominator, contributing minimally to total Cost.

Scenario B (NVMe Flash): If sreadtim is extremely low (e.g., 0.5ms), the denominator shrinks. The relative contribution of CPU processing to Cost increases dramatically.

This explains why, on modern high-performance hardware, DBAs often see execution plans where CPU-intensive operations appear to have astronomically high costs, even if disk I/O is minimal. The CBO is correctly signaling that, relative to lightning-fast storage, CPU processing is now the bottleneck.

Validation: Extract the numerical COST from any plan and multiply it by your system’s SREADTIM (from sys.aux_stats$). The result will match the predicted execution time (the TIME column) within rounding errors:

Predicted Time ≈ 500 × 10ms = 5000ms (5 seconds)

If a DBA leaves system statistics at their default values (which often reflect hardware from the late 1990s), the CBO’s “time” map will be hopelessly out of sync with reality.

Myth #2: “The CBO Always Chooses the Lowest-Cost Plan”

The Popular Belief: The optimizer’s sole purpose is to find and execute the plan with the lowest calculated cost.

The Reality: Jonathan Lewis documented cases where the optimizer intentionally discards lower-cost plans in favor of higher-cost alternatives.

The Unpeekable Bind Variable Defense

This happens when dealing with “unpeekable” bind variables—values unknown at compile time like sys_context, subqueries returning constants, or fine-grained access control predicates.

The optimizer follows an internal defensive rule: when an index predicate starts with an unknown range, the CBO reasons that the worst case might require a full index scan. Rather than risk catastrophic performance, it chooses a “safer” plan with a known starting predicate—even when that plan has higher calculated cost.

This isn’t a bug; it’s defensive engineering.

Consider: Would you rather have a query that’s consistently 100ms, or one that’s usually 10ms but occasionally explodes to 10 minutes? The CBO’s conservative choice prioritizes predictability over theoretical optimality.

Myth #3: “The Clustering Factor Measures Data Organization”

The Popular Belief: A “good” Clustering Factor (CF) near the number of table blocks means data is well-organized. A “bad” CF near the number of rows means data is scattered.

The Reality: The Clustering Factor algorithm fails catastrophically in high-concurrency environments, making excellent data organization appear terrible to the optimizer.

The Round-Robin Failure Mode

The CF algorithm simulates reading the entire table via an index in sorted order, assuming a buffer cache size of one block. It counts “block switches” as it traverses index entries:

Walk the index leaf blocks in sorted order
For each entry, extract the ROWID and determine the table block address
If current block ≠ previous block: increment CF

The ASSM Disaster:

In environments using Automatic Segment Space Management (ASSM) or multiple Free Lists, Oracle separates concurrent INSERT sessions into different physical blocks to prevent buffer contention. This creates “round-robin” distribution:

Imagine three concurrent sessions inserting into a table with an indexed ID column:

Block A: ID 1, 4, 7, 10… (Session 1)
Block B: ID 2, 5, 8, 11… (Session 2)
Block C: ID 3, 6, 9, 12… (Session 3)

Human perspective: The data is highly clustered—IDs 1-12 reside in just three blocks. A range query needs only these three blocks.

CF Algorithm’s view: Every single index entry triggers a “block switch.” CF equals the total number of rows—the worst possible score.

CBO’s conclusion: “To read 1000 rows via this index, I’ll need 1000 physical I/Os because data is randomly scattered.”

Reality: After reading Block A for ID 1, it remains in the buffer cache. When the index asks for ID 4 (also in Block A), it’s a cache hit, not physical I/O.

The standard Clustering Factor algorithm fails to account for the buffer cache. In ASSM/RAC environments, where data is “locally fragmented but globally dense,” the CBO pessimistically reverts to Full Table Scans when Index Range Scans would be orders of magnitude faster.

The Fix

Oracle 11g/12c introduced TABLE_CACHED_BLOCKS preference in DBMS_STATS, which simulates a small buffer cache during statistics gathering. This produces an Adjusted Clustering Factor that reflects reality in modern architectures.

Myth #4: “Fresh Statistics Are Always Better”

The Popular Belief: Gather statistics as often as possible—nightly, or even more frequently.

The Reality: Graham Wood (Oracle performance expert) explains: “You don’t necessarily need up-to-date statistics. You need statistics that are representative of your data.”

The Consistency Paradox

If you gather statistics every night, your system may behave differently every day as the optimizer makes different choices based on small data shifts. As Anjo Kolk asked: “Do you want the optimizer to give you the best performance, or consistent performance?”

Maria Colgan (Oracle Optimizer Product Manager) recommends keeping 11g statistics constant during 12c upgrades until the system stabilizes. Never change statistics and optimizer version simultaneously.

Better approach:

Let automatic statistics gathering handle most tables
Lock statistics for tables with very specific performance requirements
Only gather after significant data distribution changes
Test statistics changes in non-production first

Myth #5: “5500 Rows is Enough for Sampling (And That’s Bad)”

The Popular Belief: Oracle’s default sample size of ~5500 rows is an arbitrary shortcut that produces inaccurate statistics. Real DBAs use 30% or higher.

The Reality: The number 5500 is rooted in statistical sampling theory, not arbitrary convenience.

The Mathematical Foundation

Oracle internal research established that a random sample of 5500 rows is sufficient to determine data distribution with 90% confidence, provided the data doesn’t exhibit pathological skew. This theorem asserts that, for identifying the shape of a data distribution, population size is largely irrelevant—whether a table has 1 million or 1 billion rows.

Escalating sample size to 10% or 30% yields diminishing returns in accuracy while incurring linear penalties in performance.

The 11g/12c Revolution

The real secret: Modern AUTO_SAMPLE_SIZE reads 100% of the rows—but uses probabilistic counting algorithms (similar to HyperLogLog) that scan all rows without sorting them. The algorithm hashes values into a bounded memory “sketch” to estimate NDV with near-perfect accuracy.

The irony: Manually setting ESTIMATE_PERCENT => 30 is a retrograde step. It forces the old, slower sorting code path. Leaving it at AUTO_SAMPLE_SIZE engages the modern algorithm that provides deterministic accuracy for basic stats while being faster than manual sampling.

Myth #6: “IN vs EXISTS—One is Always Faster”

The Popular Belief: EXISTS is always faster than IN, or vice versa.

The Reality: Post-Oracle 9i, both are essentially identical. The optimizer transforms IN subqueries and EXISTS subqueries, then costs both to produce the best plan.

This was genuinely true in pre-9i versions, which is why the myth persists—people learned it decades ago and continue teaching it without testing whether it still applies. Modern Oracle developers can stop worrying about this distinction entirely.

Myth #7: “Hints Are Suggestions the Optimizer Can Ignore”

The Popular Belief: Oracle hints are “soft recommendations” the optimizer can freely disregard.

The Reality: Tanel Poder’s research confirms: “Oracle SQL hints are directives, exact orders that the optimizer MUST obey—but only if they are valid in the context of the current plan shape evaluated.”

The Seven Levels of Hint Invalidity

When a hint appears to be “ignored,” it’s actually invalid for specific reasons:

Syntax is incorrect
Referenced object or index doesn’t exist
Hint conflicts with another hint
Query transformation made the hint inapplicable
Table alias doesn’t match
The hint is semantically impossible
Internal optimizer constraints

Oracle 19c added a “Hint Report” feature that explicitly shows why hints weren’t used.

The distinction matters: If you believe hints are suggestions, you’ll add more hints when they don’t work. If you understand they’re orders that must be valid, you’ll debug why the hint is invalid—which leads to actual solutions.

Myth #8: “FIRST_ROWS Mode Makes Queries Faster”

The Popular Belief: FIRST_ROWS is a “faster” version of the optimizer, optimized for OLTP.

The Reality: FIRST_ROWS (without the _n) is a deprecated, heuristic-based mode that acts as a “zombie” version of the Rule-Based Optimizer.

The Lobotomy Effect

FIRST_ROWS injects hard-coded rules that override cost mathematics:

Index Bias: If an index exists, use it—even if a Full Table Scan is mathematically faster
Nested Loops Bias: Always prefer Nested Loops over Hash Joins regardless of data volume
Statistics Ignorance: It largely ignores the probability that the index might retrieve 90% of the table

Using FIRST_ROWS effectively lobotomizes the CBO, forcing it to ignore sophisticated cost models in favor of rigid rules from the 1990s. It ignores high Clustering Factors, low selectivity, and everything else that makes modern optimization work.

The Correct Alternative: FIRST_ROWS_n

FIRST_ROWS_n (where n = 1, 10, 100, 1000) is fully Cost-Based. It alters the cost mathematics by prorating costs based on the assumption that you’ll stop fetching after n rows:

Cost_n  ≈  Cost_total × (n / Total Rows)

But beware: this math relies heavily on accurate “Total Rows” estimates. Cardinality errors cascade through this calculation too.

Myth #9: “Index Rebuilds Improve Performance”

The Popular Belief: Indexes become “unbalanced” and require regular rebuilding—many shops run rebuild scripts weekly.

The Reality: Jonathan Lewis and Richard Foote have written extensively debunking this. B-tree indexes self-balance, and regular rebuilds are almost never necessary.

The Only Valid Reasons to Rebuild

Measurable performance gain worth the operational cost (rare)
Administrative benefit like reclaiming space after massive deletes

The Harm from Unnecessary Rebuilds

ORA-01410 errors from invalid ROWIDs during active use
Wasted system resources
Potential performance degradation if timed poorly
Lock contention blocking operations

The origin of the myth: Oracle training materials from the 1990s recommended index rebuilds. These recommendations appeared in documents like the Oracle OCP 9i Database Performance Tuning Exam Guide and became “best practice” that people still follow decades later.

Myth #10: “optimizer_index_cost_adj Fixes Index-Shy Behavior”

The Popular Belief: The default CBO is “index-shy” or biased toward Full Table Scans. Lowering optimizer_index_cost_adj to 20 or 50 fixes this.

The Reality: The CBO is not naturally biased toward FTS. The cost formula for FTS using Multi-Block Reads is often highly optimized because sequential I/O is physically efficient.

The Death by Random I/O

OICA represents a percentage. A value of 10 means “Discount index access costs by 90%.”

When a DBA sets OICA to 10, they’re telling the CBO that reading a single block via an index is 10 times faster than reading a block via a table scan. This is physically impossible on spinning disk and highly unlikely even on Flash.

This artificial discount breaks Nested Loop Join costing. The optimizer selects Nested Loops for massive datasets where Hash Joins are superior. The query performs millions of “cheap” index lookups that cumulatively take hours—the classic “death by random I/O” profile.

The Correct Parameter

If the goal is to account for cached index blocks, use optimizer_index_caching instead. This parameter tells the CBO what percentage of index blocks are likely in the buffer cache, reducing the I/O component while leaving CPU costs intact. This avoids the catastrophic side effects on join method selection.

Myth #11: “The CBO Uses All Available Information”

The Popular Belief: The optimizer considers everything relevant when making decisions.

The Reality: Several factors that seem obviously important are completely invisible to the optimizer.

What the CBO Cannot See

Migrated Rows: DBMS_STATS does not collect migrated row counts. The CBO is “blissfully unaware” of row migration entirely. An index range scan on a table with 50% migrated rows will have actual I/O roughly double what CBO predicts.

Buffer Cache State: The optimizer assumes ALL I/O is physical. It doesn’t consider that hot blocks are likely cached, that the same block may be read multiple times, or that index root/branch blocks are almost always in memory.

Column Correlation: CBO assumes column values are independent. For predicates on two correlated columns, it multiplies selectivities:

WHERE state = 'California' AND city = 'San Francisco'

The CBO calculates: 0.05 × 0.001 = 0.00005

Reality: Every San Francisco is in California. True selectivity is ~0.001. The CBO expects almost zero rows but finds thousands. This underestimation ripples through the entire plan, causing the optimizer to choose Nested Loops expecting instant return, then grinding through thousands of unexpected rows.

The Selectivity Calculation: Where Everything Goes Wrong

The accuracy of every execution plan depends entirely on the optimizer’s ability to estimate the fraction of rows (selectivity) that will satisfy predicates. This fraction, multiplied by total rows, yields Cardinality—which drives every decision about join methods and join orders.

The Join Cardinality Formula

Card_Join  =  Card_Outer × Card_Inner × Selectivity_Join

Where:

Selectivity_Join  ≈  1 / max(NDV_Outer, NDV_Inner)

This formula assumes the distinct values of the smaller column are a subset of the larger. When datasets are disjoint (keys don’t overlap), the CBO still calculates positive cardinality, expecting matches where none exist.

The Single Most Important Insight

Right cardinality equals right plan. Wrong cardinality equals wrong plan.

Every other technique—hints, parameters, statistics—exists to get that cardinality estimate closer to reality.

War Story: 10GB of I/O to Fetch One Row

A 4-node RAC running Oracle 19.12 had a trivially simple query causing massive load:

SELECT dbms_rowid.rowid_type(rowid) FROM <table> WHERE rownum = 1;

Expected: One multiblock read, return immediately.

Actual: 81,215 read requests, 10GB of I/O, 123 seconds elapsed.

The Root Cause

The optimizer chose an index fast full scan based on user_indexes.leaf_blocks showing a small value. But the actual index had grown to 13GB (versus a 4GB table) due to empty leaf blocks that statistics gathering couldn’t see.

The index fast full scan had to traverse massive empty space before finding actual data. Statistics on leaf_blocks don’t count empty leaf blocks.

Certain DML patterns—like a status column indexed for quick access to “new” values that cycle through statuses—can leave indexes in catastrophic states that statistics completely miss.

After index rebuild: 13GB reduced to 30MB.

War Story: 44,000 Child Cursors for One SQL Statement

A production Oracle 11.2.0.3 system discovered a single SQL_ID with 44,217 different child cursors in the library cache.

The Destructive Feedback Loop

The query used 9 bind variables across 9 predicates, all on columns with histograms. Extended Cursor Sharing had gone haywire.

The v$sql_cs_selectivity view accumulated 17 million rows for this single statement. Each execution:

Peeked at bind values
Searched for matching selectivity ranges
Found no match (BIND_EQUIV_FAILURE)
Compiled a new plan
Inserted a new selectivity range

This created exponential cursor proliferation.

The fix: SET _optimizer_extended_cursor_sharing_rel = none

The lesson: Adaptive Cursor Sharing can represent a serious threat where disabling it becomes necessary.

War Story: Any Arithmetic Disables Bind Peeking

Jonathan Lewis documented a surprising blind spot (August 2025): any arithmetic operation on bind variables completely disables peeking for that predicate.

-- Bind peeking WORKS:
WHERE rownum <= :b1

-- Bind peeking FAILS:
WHERE rownum <= :b1 + :b2    -- CBO ignores peeked values!
WHERE rownum <= :b1 + 0      -- Even trivial arithmetic breaks it!

The Modern Pagination Trap

The OFFSET x ROWS FETCH NEXT y ROWS ONLY syntax internally rewrites to :offset + :fetch_size—disabling bind peeking entirely.

This explains mysterious performance problems with paginated queries that seem like they should benefit from peeking.

The 10053 Trace: Reading the Mind of the Machine

To truly understand the CBO, one must leave the comfort of EXPLAIN PLAN and enter the raw data of the 10053 Trace. This trace dump is the CBO’s diary, recording every decision, formula, and rejected plan.

Key Sections

BASE STATISTICAL INFORMATION: Raw inputs (Rows, Blocks, AVG_ROW_LEN) before any predicates. The baseline reality.

SINGLE TABLE ACCESS PATH: Cost calculation for every possible index and Full Table Scan. Shows Adjusted Selectivity and Clustering Factor impact. If an index is rejected, this section explains why.

GENERAL PLANS: Join order and method rankings. Shows the CBO testing “Table A join Table B” versus “Table B join Table A,” calculating cardinality for each step.

PEEKED BINDS: Often the only place to definitively confirm whether Bind Peeking occurred. If a plan flips unexpectedly, the 10053 trace is the forensic tool to determine if an unlucky “peek” caused the regression.

Oracle 12c’s Adaptive Features: Too Adaptive?

Oracle 12.1 introduced Adaptive Query Optimization—the biggest optimizer change since 10g. Plans could change mid-execution, switching between hash joins and nested loops on-the-fly based on actual row counts.

The Backpedaling

The results were mixed enough that Oracle made significant changes in 12.2:

Feature	12.1 Default	12.2 Default
OPTIMIZER_ADAPTIVE_PLANS	TRUE	TRUE
OPTIMIZER_ADAPTIVE_STATISTICS	TRUE	FALSE
SQL Plan Directives (full)	Enabled	Limited

Oracle found adaptive statistics features “more useful in environments where queries, data distributions and schemas are very complex.” For OLTP systems, the additional optimization time became significant relative to query runtime.

The features that were supposed to help were sometimes doing more harm than good.

How Oracle Differs from Other Databases

PostgreSQL deliberately has no native hints. The PostgreSQL community refuses to implement them, arguing hints encourage bad habits and don’t scale with data changes. The pg_hint_plan extension exists but isn’t part of core PostgreSQL.

Oracle’s runtime plan adaptation is unique. Oracle 12c+ can switch join methods mid-execution. Neither PostgreSQL nor MySQL has any equivalent. SQL Server introduced adaptive joins in 2017 with more limited scope.

Star transformation is Oracle-only. Oracle can automatically rewrite star schema queries, transforming dimension table joins into bitmap semijoin predicates. No equivalent exists in PostgreSQL, MySQL, or SQL Server.

Conclusion: The Deterministic Chaos

The Oracle CBO is not magic; it is a complex assembly of mathematical models attempting to simulate the physical reality of data retrieval. Its “secrets” are merely the boundary conditions of these models.

Cost is Time, but normalized time susceptible to distortions by system statistics
The Clustering Factor is a geometric measure of index-order fragmentation, not necessarily a measure of physical retrieval efficiency in buffered environments
Sampling (the 5500 theorem) is a statistically proven shortcut that evolved into a 100% approximate scan algorithm in modern versions
Selectivity formulas are brittle, breaking down under column correlation and skew, requiring interventions like Extended Statistics and Histograms

Understanding these internals moves the database professional from “knob-turner”—guessing at parameter values—to “systems architect” who aligns physical database design with the mathematical expectations of the Optimizer.

The CBO is a mirror; it reflects the accuracy of the statistics and the reality of the hardware. When it “fails,” it is usually because the reflection has been distorted by poor statistics, incorrect parameters, or a physical design that defies the model’s assumptions.

The single most important insight remains: Right cardinality equals right plan. Wrong cardinality equals wrong plan. Everything else is details.

Have you encountered any of these CBO behaviors in production? Share your war stories in the comments below.

References

Jonathan Lewis, “Cost Based Oracle Fundamentals” (Apress, 2005)
Jonathan Lewis, Oracle Scratchpad Blog – jonathanlewis.wordpress.com
Richard Foote’s Oracle Blog – richardfoote.wordpress.com
Tanel Poder, “7 Levels of Hint Invalidity” – tanelpoder.com
Oracle Optimizer Blog – blogs.oracle.com/optimizer
Tom Kyte, Ask Tom – asktom.oracle.com
Mohamed Houri’s Oracle Notes – hourim.wordpress.com
ORACLE-BASE – oracle-base.com