Phase 1: The MaterializedMySQL Trap
MySQL is an outstanding transactional database but it severely struggles with heavy analytical queries. Moving these workloads to ClickHouse is the definitive solution. However if you read older migration guides from popular database vendors they will almost universally instruct you to use the MaterializedMySQL engine.
Do not execute those commands. The ClickHouse team officially deprecated and removed the MaterializedMySQL engine in version 24.12. It was highly experimental and fundamentally flawed at scale. The true enterprise standard for achieving zero downtime replication is Change Data Capture commonly referred to as CDC.
Phase 2: Network Latency and SaaS Economics
Many modern tutorials suggest using fully managed SaaS platforms like Confluent Cloud or ClickPipes to handle your CDC streaming. While these tools are convenient they introduce a massive financial trap. When you sync terabytes of operational data across different cloud regions public providers will charge you astronomical network egress fees.
Furthermore change data capture is highly sensitive to network latency. If your primary MySQL database is located in North America hosting your open source Redpanda and ClickHouse architecture on dedicated servers in the USA ensures sub millisecond communication. This localized bare metal approach eliminates replication lag during peak transactional hours while completely avoiding per gigabyte cloud billing shocks.
Phase 3: Advanced Schema Mapping and Snapshot
Before activating the live stream we must copy the historical data. The biggest mistake engineers make here is assuming basic data types map perfectly. In production environments you must handle null values financial decimals and timezones meticulously.
You must manually create the destination table first mapping MySQL data types to ClickHouse advanced types. Once created use the native MySQL table function to pull the data at maximum speed.
-- Creating a production ready ClickHouse schema
CREATE TABLE orders_analytics (
order_id UInt64,
customer_name Nullable(String), -- Handling MySQL NULLs
amount Decimal(10, 2), -- Financial precision
status Enum8('PENDING' = 1, 'PAID' = 2), -- Strict enumerations
created_at DateTime('UTC') -- Timezone awareness
) ENGINE = MergeTree()
ORDER BY order_id;
-- Execute the high speed initial data copy
INSERT INTO orders_analytics
SELECT * FROM mysql('10.0.0.5:3306', 'prod_db', 'orders', 'user', 'pass');
Phase 4: The 2026 CDC Streaming Pipeline
To capture live transactions we use Debezium to read the MySQL binary logs. Debezium will push these changes to an event streaming message broker.
The Kafka vs Redpanda Reality: Apache Kafka is the battle tested enterprise standard with a massive ecosystem. You can absolutely use it. However running JVMs can be resource heavy. For bare metal NVMe servers we often recommend Redpanda as a drop in C++ alternative for simpler operations and lower latency. Both work perfectly for this pipeline.
# Example Debezium Connector Configuration pushing to your broker
{
"name": "mysql-clickhouse-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "10.0.0.5",
"database.include.list": "prod_db",
"table.include.list": "prod_db.orders",
"database.history.kafka.bootstrap.servers": "broker_host:9092",
"database.history.kafka.topic": "schema-changes.orders"
}
}
Phase 5: The Missing Ingestion Layer
Many tutorials skip a critical step. How does data actually flow from the Kafka topic into the ClickHouse storage table? You need an ingestion layer. ClickHouse provides a native Kafka Engine that reads the message stream and a Materialized View that routes those messages into your final analytical table.
-- 1. Create the Kafka Engine Consumer
CREATE TABLE orders_kafka_queue (
order_id UInt64,
amount Decimal(10, 2),
status String,
op_type String -- Debezium operation type (create update delete)
) ENGINE = Kafka()
SETTINGS kafka_broker_list = 'broker_host:9092',
kafka_topic_list = 'prod_db.orders',
kafka_group_name = 'clickhouse_consumer',
kafka_format = 'JSONEachRow';
-- 2. Route data to the final table
CREATE MATERIALIZED VIEW orders_mv TO orders_analytics_final AS
SELECT order_id, amount, status,
if(op_type = 'd', 1, 0) AS is_deleted,
now() AS updated_at
FROM orders_kafka_queue;
Phase 6: Tombstones The FINAL Trap and Storage Tax
ClickHouse is an append only database. When Debezium detects a deleted row in MySQL it sends a tombstone record. To process this we use the ReplacingMergeTree engine with a deleted flag. However this introduces two massive production challenges.
- The Storage Tax: The ReplacingMergeTree does not delete old rows immediately. It waits for a random background merge causing storage amplification. To manage this schedule an
OPTIMIZE TABLE orders_analytics_final FINAL command during off peak night hours to force a cleanup. - The FINAL Trap: Many blogs tell you to use the FINAL keyword in your SELECT queries to get the latest row. Do not do this. It causes massive CPU spikes. Instead use the
argMax function to efficiently fetch the latest state without locking the database.
-- The Enterprise way to query updated records without the FINAL keyword
SELECT
order_id,
argMax(amount, updated_at) AS latest_amount,
argMax(status, updated_at) AS latest_status
FROM orders_analytics_final
GROUP BY order_id
HAVING argMax(is_deleted, updated_at) = 0;
Phase 7: Fault Tolerance and Cutover
Before routing live traffic ensure your pipeline is fault tolerant. Configure a Dead Letter Queue inside your Kafka or Redpanda broker to catch schema mismatch errors. Ensure your ClickHouse ReplicatedReplacingMergeTree tables have a replication factor of at least two across different bare metal nodes.
Once verified update your application code to route all heavy aggregations dashboard requests and report generation queries to ClickHouse. Your MySQL database is now relieved of analytical strain allowing it to focus purely on rapid transactional writes.