This is one of a series of posts revolving around debugging stories from the field. The goal of this series is to help demonstrate how to debug issues in a Hadoop application, as well as show some of the decision making process that goes into diagnosing issues. I encourage you to look past just the problem and the solution, and instead focus on the process. Feel free to post comments on possibilities you think we missed, or questions on why we went a certain direction, so that we can all learn from each other! All names are omitted in order to protect the innocent and not-so-innocent.
Background
A client has a medium sized kerberized HDP 2.6.0 installation. Hive Interactive is disabled and Hive is set up to use HTTP transport mode. They report that suddenly no Hive jobs are running successfully on one of their clusters. This includes metastore-only jobs like SHOW DATABASES as well as more complex queries. Through beeline, I was able to observe the following error appeared. Similar errors were seen through Zeppelin and Hive CLI.
The first thing we noticed is the 401 response code, which means unauthorized. We verified that the connection string was correct, and used klist to determine that the user had correctly gotten a Kerberos TGT to use with Hive. We also verified no known configuration changes had been made which might have caused this issue to occur.
Once this was ensured, we checked the Hiveserver2 log. In there, we found the following lines.
The error message "cannot locate default realm" at the bottom is troublesome. That's the sort of error you would expect to see when Kerberos is improperly configured, but this is a system that has been working for years with no issues.
With this in hand, we searched for the krb5.conf file on the Hiveserver2 node itself. On a stroke of luck, we not only found the standard /etc/krb5.conf file, but we also found /home/hive/krb5.conf. The latter file was empty, but had been written about when the issues started to appear.
Renaming this file to temporarily "remove" it, we saw Hive queries start to work again. With this, we were confident that Hive was using the krb5.conf file in its home directory mistakenly, causing the root issue. Now the question was how that file got there to begin with, when no one but admins should have access to that directory.
Determining Root Cause
Looking closer at the empty krb5.conf file, it was written by the hive user. Additionally, only the hive user (and root) had access to that directory. We then looked at the history for those two users, to try to determine which user created the file. Sadly, neither user appeared to create the file through the command line.
Since we suspected it was the hive user that created this file (it was owned by the hive user after all), we looked through the queries that were being run on the cluster around the time it failed. Thankfully, we came across the following query, which was run about the time when the jobs started failing.
Note that this query calls the UDF test_skey. Talking with the team that ran this query, they were testing out a function they hoped would allow them to authenticate themselves as their service account, and then perform some operations. This didn't work because the job ran as the hive user (since impersonation was turned off), and the hive user didn't have access to the keytab and krb5.conf files they tried to pull down.
To sum up, the immediate hypothesis was that the UDF ran as the hive user (since impersonation was disabled), and had its working directory set to /home/hive. The UDF was called, which pulled down the empty krb5.conf file to /home/hive. The query failed, but the krb5.conf file remained. Then, as guessed earlier, Hiveserver2 instance tried to use this file instead of /etc/krb5.conf, causing the failure.
Creating Files on Hiveserver2 Node
There was one last thing that was bugging me about this issue. The team and I both expected the UDF to run on the data nodes, and yet the empty file was only found on the Hiveserver2 node. Why would the query run on the Hiveserver2 node itself?
The answer to this lies in the optimizations that Hive does automatically. If you have a job which is very small, even if it requires accessing data from a table, it may decide to not start a MapReduce or Tez job to complete it, but instead complete the query locally on the Hiveserver2 node. It seems this is true with any static, single-row query, similar to the one used here. Additionally, even if some data from the table is used, but we limit it to one row, Hive may decide to forego a YARN job. I've created a repository that proves out this hypothesis here. In this way, it is actually possible to run custom code through a UDF on the Hiveserver2 node directly, as well as any of the data nodes, depending on the query which uses the UDF.
Next Steps and Lessons Learned
Now that we had found the issue and resolved the short-term issue, we needed to determine how to prevent something like this happening in the future.
The first recommendation is to use Ranger to restrict UDF usage only to UDFs which have been vetted previous to deployment. Documentation on how to do that is available here.
We also worked with the development team to advise that they shouldn't be authenticating themselves as a specific user in the UDF, but instead using the Hive user, or making the job itself into a Spark job. This gets away from all of the complications with authentication being handled manually, instead letting the system work as it is intended.