Guest estera-kot Posted December 14, 2022 Posted December 14, 2022 We are excited to announce the preview availability of Apache Spark™ 3.3 on Synapse Analytics. The essential changes include features which come from upgrading Apache Spark to version 3.3.1 and upgrading Delta Lake to version 2.1.0. Check out the official release notes for Apache Spark 3.3.0 and Apache Spark 3.3.1 for the complete list of fixes and features. In addition, review the migration guidelines between Spark 3.2 and 3.3 to assess potential changes to your applications, jobs and notebooks. Below is an extended summary of key new features that we are highlighting of importance in this article for you to check out in more details. Related to Apache Spark version 3.3.0 and 3.3.1 Row-level runtime filtering: improve the performance of joins by pre-filtering one side of a join using a Bloom filter and IN predicate generated from the values from the other side of the join (SPARK-32268) Improve the compatibility of Spark with the SQL standard: ANSI enhancements (SPARK-38860) Error Message Improvements to identify problems faster and take the necessary steps to resolve it (SPARK-38781) Support complex types for Parquet vectorized reader. Previously, Parquet vectorized reader has not supported nested column type (struct, array and map). The Apache Spark 3.3 contains an implementation of nested column vectorized reader for FB-ORC in our internal fork of Spark. It impacts performance improvement compared to non-vectorized reader when reading nested columns. In addition, this can also help improve the non-nested column performance when reading non-nested and nested columns together in one query (SPARK-34863) Allows users to query the metadata of the input files for all file formats, expose them as built-in hidden columns meaning users can only see them when they explicitly reference them (e.g. file path, file name) (SPARK-37273) Provide a profiler for Python/Pandas UDFs (SPARK-37443) Previously, streaming queries with Trigger. Which was loading all of the available data in a single batch. Because of this, the amount of data the queries could process was limited, or the Spark driver would be out of memory. Now, introducing Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches (SPARK-36533) More comprehensive DS V2 push down capabilities (SPARK-38788) Executor Rolling in Kubernetes environment (SPARK-37810) Support Customized Kubernetes Schedulers ( SPARK-36057) Migrating from log4j 1 to log4j 2 (SPARK-37814) to gain in: Performance: Log4j 2 is much faster than Log4j 1. Log4j 2 uses asynchronous logging by default, which can improve performance significantly. Flexibility: Log4j 2 provides more flexibility in terms of configuration. It supports multiple configuration formats, including XML, JSON, and YAML. Extensibility: Log4j 2 is designed to be extensible. It allows developers to create custom plugins and appenders to extend the functionality of the logging framework. Security: Log4j 2 provides better security features than Log4j 1. It supports encryption and secure socket layers for secure communication between applications. Simplicity: Log4j 2 is simpler to use than Log4j 1. It has a more intuitive API and a simpler configuration process. [*]Introduce shuffle on SinglePartition to improve parallelism and fix performance regression for joins in Spark 3.3 vs Spark 3.2 (SPARK-40703) [*]Optimize TransposeWindow rule to extend applicable cases and optimize time complexity (SPARK-38034) [*] To have a parity in doing TimeTravel via SQL and Dataframe option, support timestamp in seconds for TimeTravel using Dataframe options (SPARK-39633]) [*]Optimize global Sort to RepartitionByExpression to save a local sort (SPARK-39911) [*]Ensure the output partitioning is user-specified in AQE (SPARK-39915) [*]Update Parquet V2 columnar check for nested fields (SPARK-39951) [*]Reading in a parquet file partitioned on disk by a `Byte`-type column (SPARK-40212) [*]Fix column pruning in CSV when _corrupt_record is selected (SPARK-40468) [*]Some dev's examples: distributed-sequence: import pyspark.pandas as ps ps.options.compute.default_index_type # returns 'distributed-sequence' merge_asof: datetime: Related to Delta Lake version 2.1.0 Support for Apache Spark 3.3. Support for [TIMESTAMP | VERSION] AS OF in SQL. With Spark 3.3, Delta now supports time travel in SQL to query older data easily. With this update, time travel is now available both in SQL and through the DataFrame API. Support for Trigger.AvailableNow when streaming from a Delta table. Spark 3.3 introduces Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches. This is now supported when using Delta tables as a streaming source. Support for SHOW COLUMNS to return the list of columns in a table. Support for DESCRIBE DETAIL in the Scala and Python DeltaTable API. Retrieve detailed information about a Delta table using the DeltaTable API and in SQL. Support for returning operation metrics from SQL Delete, Merge, and Update commands. Previously these SQL commands returned an empty DataFrame, now they return a DataFrame with useful metrics about the operation performed. Optimize performance improvements Added a config to use repartition(1) instead of coalesce(1) in Optimize for better performance when compacting many small files. Improve Optimize performance by using a queue-based approach to parallelize the compaction jobs. [*]Other notable changes Support for using variables in the VACUUM and OPTIMIZE SQL commands. Improvements for CONVERT TO DELTA with catalog tables. Autofill the partition schema from the catalog when it’s not provided. Use partition information from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed. [*]Support for Change Data Feed (CDF) batch reads on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have not been used. See the documentation for more details. [*]Improve Update performance by enabling schema pruning in the first pass. [*]Fix for DeltaTableBuilder to preserve table property case of non-delta properties when setting properties. [*]Fix for duplicate CDF row output for delete-when-matched merges with multiple matches. [*]Fix for consistent timestamps in a MERGE command. [*]Fix for incorrect operation metrics for DataFrame writes with a replaceWhere option. [*]Fix for a bug in Merge that sometimes caused empty files to be committed to the table. [*]Change in log4j properties file format. Apache Spark upgraded the log4j version from 1.x to 2.x which has a different format for the log4j file. Refer to the Spark upgrade notes. To learn more and further details, review the full Delta Lake 2.1.0 release notes. For detailed contents and lifecycle information on all runtimes, you can visit Apache Spark runtimes in Azure Synapse. Please note, that the team is working on upgrading some majors components versions during the public preview, before annoucing GA. This upgrade will ensure that the latest versions of these components are available for use, and that any potential bugs or security issues are addressed. Continue reading... Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.