The Databricks Runtime has been highly optimized by the original creators of Apache Spark. The significant increase in performance enables new use cases not previously possible for data processing and pipelines and improves data team productivity.
The runtime leverages auto-scaling compute and storage to manage infrastructure costs. Clusters intelligently start and terminate, and the high cost-to-performance efficiency reduces infrastructure spend.
Databricks has wrapped Spark with a suite of integrated services for automation and management to make it easier for data teams to build and manage pipelines, while giving IT teams administrative control.
Caching: Copies of remote files are cached in local storage using a fast intermediate data format, which results in improved successive read speeds of the same data.
Z-Order Clustering: The colocation of related information in the same set of files dramatically reduces the amount of data that needs to be read, resulting in faster query responses.
Join Optimizations: Significant performance gains are possible with range join and skew join optimizations through different query patterns and skew hints.
Data Skipping: Statistical information on minimum and maximum values automatically collected when data is written is used at query time to provide faster queries.
Easy-to-Use Cluster Management: User-friendly user interface simplifying the creation, restarting, and termination of clusters, providing increased visibility to your clusters for easier manageability and cost control.
High Availability: The Databricks cluster manager transparently relaunches any worker instance that is revoked or crashes, ensuring your service is always up and running without the need to manage it yourself.
Elastic On-Demand Clusters: Build on-demand clusters in minutes with a few clicks, and scale up or down based on your current needs. Reconfigure or reuse resources as needs change for your team or service.
Backward Compatibility with Automatic Upgrades: Choose the version of Spark you want to use, ensuring legacy jobs can continue to run on previous versions, while getting the latest version of Spark hassle free.
Flexible Scheduler: Execute jobs for production pipelines on a specified schedule, from minute to monthly intervals in different time zones, including cron syntax and relaunch policies.
Notifications: Notify a set of users whenever a production job starts, fails, and/or completes with zero human intervention, through email or third party production pager integration, for peace of mind.
Flexible Job Types: Run different types of jobs to meet your different use cases, including notebooks, Spark JARs, custom Spark libraries and applications.
Optimized Data Sources: Central repository for your Spark Data Sources, with broad support including SQL, NoSQL, Columnar, Document, UDFs, File stores, File formats, Search engines, and more.
The Databricks Runtime implements the open Apache Spark APIs with a highly optimized execution engine, which provides significant performance gains compared to standard open source Apache Spark found on other cloud Spark platforms. This core engine is then wrapped with additional services for developer productivity and enterprise governance.