Small Cluster instances for reading files
A big pain point of spark / databricks is reading millions of small files. Unfortunately, that's a common scenario. It's possible to read those when spinning a lot of workers at the same time but unfortunately, it's also quite expensive. It would be great to have a cluster type with really little ram just for reading all those files. Afterwards is possible to perform a coalesce to pack them into bigger parquet files for further processing with a bigger cluster and less worker nodes.