Best explanation of delta parquet and its features that I've seen so far.
Sir, I commend your work, and I have all the best wishes in my heart for you. I was working as a junior data engineer at a company here in Canada; your teachings perfectly align with the skills that are required to shine in this field. As a junior, I always have so many questions about data, and your course addresses those above and beyond. Please don't leave making videos; we need you. If you can do some lectures on PySpark, that would be great.
You are the best teacher which I came across, your lectures have increased my interest in Data lake many folds.
DP-203: 08 - Notes: Delta lake is a storage format like parquet, csv Data lake is a storage itself Delta lake is parquet file + transaction log (json) Each update - new log file and parquet files which were updated - acid guarantees! atomicity partitioning (sub directories) like country = us, country = ua parallelism consistency state 1 to state 2 isolation concurrency - optimistic model (in sql pessimistic by default) snapshot isolation durability redundant strategy - possible history audit (basic) > DESCRIBE HISTORY tableName - time traveling > SELECT * FROM tableName TIMESTAMP AS OF '2023-10-07T16:09:18.000+0000' -- show what was before this date or > SELECT * FROM tableName VERSION AS OF 2 - rollback of changes easy (in case of accidentaly updates) - vacuuming operations remove uncommitted files no longer need (retention), default 7 days if we run this operation impact time travel - schema enforcement x parquet v delta (schema mismatch detected, when in parque usually not and lead to data corruption) - check & constraints > ALTER TABLE tableName ADD CONSTRAINT GenderCheck CHECK (gender IN ('F','M') - schema evolution capabilities (data changes over time) > df.write.option("mergeSchema","true").mode("append").format("delta").save(...) - merge (new data insert/update by ID) - OPTIMIZE AND Z-ORDER small files combine to bigger ones, could be used z-order - how data will be physically ordered in files - unified batch and streaming (we could use the same code for both as result)
