上QQ阅读APP看书，第一时间看更新

Understanding data sources in Spark applications

Spark can connect to many different data sources, including files, and SQL and NoSQL databases. Some of the more popular data sources include files (CSV, JSON, Parquet, AVRO), MySQL, MongoDB, HBase, and Cassandra.

In addition, it can also connect to special purpose engines and data sources, such as ElasticSearch, Apache Kafka, and Redis. These engines enable specific functionality in Spark applications such as search, streaming, caching, and so on. For example, Redis enables deployment of cached machine learning models in high performance applications. We discuss more on Redis-based application deployment in Chapter 12, Spark SQL in Large-Scale Application Architectures. Kafka is extremely popular in Spark streaming applications, and we will cover more details on Kafka-based streaming applications in Chapter 5, Using Spark SQL in Streaming Applications, and Chapter 12, Spark SQL in Large-Scale Application Architectures. The DataSource API enables Spark connectivity to a wide variety of data sources including custom data sources.

Refer to the Spark packages website https://spark-packages.org/ to work with various data sources, algorithms, and specialized Datasets.

In Chapter 1, Getting Started with Spark SQL, we used CSV and JSON files on our filesystem as input data sources and used SQL to query them. However, using Spark SQL to query data residing in files is not a replacement for using databases. Initially, some people used HDFS as a data source because of the simplicity and the ease of using Spark SQL for querying such data. However, the execution performance can vary significantly based on the queries being executed and the nature of the workloads. Architects and developers need to understand which data stores to use in order to best meet their processing requirements. We discuss some high-level considerations for selecting Spark data sources below.