Access Control on Dataproc for Hive and Spark jobs

Published in

Google Cloud - Community

8 min readJun 21, 2023

Authors: myself (@christianberzi) and Yunus Durmuş (@yunusd)

Google Cloud Dataproc is a managed open source big data service. You may run many of your popular big data engines. With Apache Spark, there is even a serverless option.

There are challenges though while you migrate from on-premises or other clouds, especially access control. Unlike on-premises environments, where Kerberos is usually used for authentication and Ranger, Sentry, HDFS ACLs for authorization, in Google Cloud, IAM rules everything.

In this article, we talk about access control for Spark and Hive on Dataproc. In particular, we cover the following topics:

What are the basics of access control?
What options do we have on Dataproc for properly handling access control?
How do different personas interact with Dataproc, particularly a Data Scientist, Data Engineer and Data Analyst?

Introduction

Pillars of access control

When we talk about access control, we usually refer to at least the following 3 concepts: authentication, authorization and audit.

Authentication is about proving that you are who you say you are, usually by confirming your identity through the use of credentials. For example, when you log in to your Google Account, you provide your password and complete any two-factor authentication requirements to confirm that you are the account owner and not being impersonated by a malicious actor.
Authorization is the process of determining whether the user or application attempting to access a resource has been authorized for that level of access. For instance, let’s say you need to access a resource, such as a file on Google Cloud Storage. The authorization process ensures that your user has the right to access that file. Otherwise, an error will be issued.
Audit helps you answer questions like “Who did what? Where? When?”. For example, let’s say you were able to access the previously mentioned file on GCS because you were authorized to do so. Then, audit logs will track this action providing information that your user accessed such a file, when you did it, and other useful details.

Access control on Dataproc: what is different from on-premises?

We’ve seen what access control generally means. Now, what’s the difference between on-premise and Cloud Hadoop environments when it comes to access control?

Differences between on-premises and Dataproc

As you can see from the above picture, on-premises Kerberos is used for authentication and Ranger/Sentry/HDFS ACLs are usually used for authorization. On the other hand, in Google Cloud (specifically on Dataproc), IAM and Cloud Identities are used both for user access to the Dataproc cluster and for user access to the data (i.e. Cloud Storage in the example). The Dataproc cluster itself also has its own credentials to read/write data that resides outside the cluster: usually the cluster’s service account credentials are used to authenticate and authorize such requests.

Hadoop modernization migrations to the Cloud can be quite challenging due to these differences between on-premises and Cloud environments. Let’s see how we can leverage access control options of Dataproc to get the most out of our Hadoop environment.

Dataproc options for access control

Now the question is: what are the options for access control we have in Dataproc?

As shown in the table above, there are two different scenarios to consider when it comes to Dataproc access control:

Access to the cluster: in this scenario we simply want to control the user or application access to a Dataproc cluster in terms of cluster usage. For instance, you want to control who can SSH into the cluster master, or who can submit jobs to the cluster, or who can edit the cluster configuration, etc.
Cluster’s access to resources: in this second scenario, we want to control the cluster access to the backend resources, usually where the data resides (e.g. Google Cloud Storage or BigQuery). In other words, in this second scenario we focus on what credentials and roles the cluster uses to access data that resides outside the cluster.

Going back to our question, Dataproc offers different options for the three pillars of access control in both the scenarios. Some options are similar to how it works on-premises (e.g. Kerberos for authentication, or Ranger to define authorization policies against resources like HDFS or Hive). These options are useful if you want to reproduce as much accurately as possible the access control options you use on-premises.
Other options are Google Cloud native, because they leverage Google Cloud out-of-the-box access control features like Cloud Identities, IAM or our Audit logs. These options are useful to modernize the access control options by taking advantage of Cloud features and get the most out of the Hadoop environment in Cloud. We’ll see a mix of these options in this article.

What options should we use then? Well, it depends on your usage pattern and the business/technical requirements. So, let’s go through the most common options by following the specific needs of these 3 personas:

Data Scientist

Let’s start with the Data Scientist who primarily wants to use Jupyter Notebooks for data exploration and model development.

Given the Data Scientist requirements, the questions we want to answer are:

What options do we have for running Jupyter Notebooks in Dataproc?
What are the access control features of these options?

Let’s begin with the first question. In Dataproc, we basically have two options for using Jupyter Notebooks (also represented in the below table):

Vertex AI Workbench with remote connection to Dataproc: in Vertex AI Workbench, you can choose a PySpark kernel that is backed by a Dataproc cluster. You can even submit a notebook file’s code to run on the Dataproc Serverless service.
Dataproc optional Jupyter component: Jupyter Notebook resides inside the cluster and then by using the Component Gateway you may access the notebook web UI.

As shown in the below table, you can use the Jupyter component both with a standard Dataproc cluster and with a cluster with the “Personal Cluster Authentication” feature enabled. Personal Cluster Authentication allows interactive workloads on the cluster to run securely as the end user’s identity. This means that interactions with other Google Cloud resources such as Cloud Storage will be authenticated as you (i.e. the user configured to use the cluster), rather than the cluster service account.

To answer the second question, let’s understand how a Spark job running on the Notebook accesses other storage resources like Google Cloud Storage. As shown in the table, there is no difference between the first two columns because in both cases (i.e. Vertex AI and Dataproc Jupyter in a standard configuration) the cluster service account is used to authenticate and authorize requests from the cluster to, for instance, a GCS bucket. Audit logs follow the same logic, in fact they contain the cluster’s service account as the data request principal.
On the other hand, the Personal Cluster Authentication feature allows Spark jobs to access other resources using the user account identity that has been set up to use the cluster. This also means that audit logs contain the user account as the data request principal, instead of the cluster service account.

We recommend the use of VertexAI notebooks with remote serverless Spark, where every Data Scientist has their own Spark. Serverless Spark autoscales so that you don’t need to share for resource efficiency. For access control, each serverless Spark should be created with a service account. The service account can have:

a team’s privileges
an individual data scientist’s privileges (this is also called as shadow SA)
a usecase’s privileges (recommended).

Data Engineer

A Data Engineer handles the data ingestion and transformation. They basically own the ETL/ELT pipelines. They don’t even need to and should not have access to data in production with fine grained access control rights like row and column level. They simply create transformation jobs with Spark, Hive, Presto etc. and then orchestrate them. Mostly they perform detailed tests in development and then run these transformations on production with orchestration.

So life is easier for the Data Engineer. They can create purpose built clusters, use them and then dispose immediately. As shown in the above figure, for each job type and profile, a Data Engineer uses different ephemeral Dataproc clusters with optimized profiles of CPU, Memory, auto-scaling and tool versions like Spark 2 or 3. Orchestration tools, Composer and Workflows, handle all the lifecycle of these clusters. For access control isolation, clusters have different access rights to backend data. These rights are determined by the service account assigned to them. Finally, all the API calls to these clusters are stored in audit logs.

Data Analyst

A Data Analyst accesses production data with fine grained rights like table, row, column based restrictions. They analyze the data and present it with dashboards. Native Dataproc IAM does not allow such fine grained access control. So what should we do?

Let’s see what is available in an on-premise Hadoop deployment and what we can do on the Cloud.

Hive: Apache Ranger Hive plugin handles the fine grained access control. On Google Cloud, we can again use Apache Ranger directly. There is a step by step guide from my colleague David explaining this setup.
Spark: Spark has no open source Ranger plugin. The reason for this is that Spark can circumvent the Ranger plugin. Spark driver can directly access data and it is controlled by the end user. So an end user may easily opt out Ranger.
– BigLake solves this problem. BigLake API handles fine grained access control for the data stored on GCS. So you will improve the authorization of your Spark jobs when you migrate to Google Cloud.
– Even with BigLake, the service account of a cluster is used for IAM. So when multiple users access data via the same cluster, they all have the same privileges on data due to the shared service account. The solution is to use serverless Spark where each user/usecase/team has their own cluster with a different service account.
SparkSQL as a distributed SQL engine: Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, without the need to write any code.
– This architecture can use Apache Ranger since it is based on HiveServer2 and the end user does not run the Spark driver itself. The setup is the same as normal Hive as explained in this guide.
– Also, we may still use BigLake.

In all the above cases the common theme is having multiple clusters even job scoped ones unlike large clusters of on-premises. And then give access to these ephemeral clusters per use-case instead of controlling all at a single point.