A strong Big Data enterprise will often have multiple clusters which are interrelated. A common set is development, test (also known as UAT, pre-production, or staging), and production. As the names imply, systems should be developed in the development cluster, migrated to test for integration testing, and then migrated to production for the real work.
The thing is, oftentimes the hosts have different names in the different environments. A common scheme is to have a single letter that denotes the environment, such as d, t, and p for development, test, and production clusters respectively. This means that any hard-coded strings pointing to the hosts will break.
The most obvious solution (and the most common solution I've seen) is to make changes to the code as you are doing the migration between environments. This is full of risks if you happen to break the code, and really ruins the whole purpose of having a test cluster, since different code will run in production anyways. In fact, I once had a development team that did this while migrating to production, and the migrated code was actually pointing to stage in one place and production in another, causing the code to fail silently, and not write the data to either place. This was only found once data scientists raised the alarm that data hadn't been updated in a week.
So what can we do to avoid this issue? In this article we'll discuss some of the ways we can write environment agnostic code, which can be run on any environment within your enterprise. This oftentimes takes some forethought and work, but it will pay off dividends in terms of fewer migration failures and fewer bugs in general.
Bash Scripts
First, lets discuss the use of host names in bash scripts. Bash scripts are still heavily used to kick off Spark and Hive jobs from workflow management solutions such as Oozie. In these cases, the Hive connection string passed to Beeline is likely to change between environments. Additionally, if you are using services like Solr or Kafka, they will need host names as well which will be environment-specific. So how should we avoid this?
My favored solution to this is to use an environmentVars.sh bash script. This script should be written once by the administrators and placed into a common location in HDFS that is globally readable. I generally recommend something like /projects/common. This script should export variables which contain all of the host names for common uses, such as the Hive connection string and the Kafka bootstrap hosts.
Once the administrators have created this script, developers can just pull the file down from HDFS and dot-source it, using a command like the following.
This will add all of the variables you set in the file into the local environment, so you can now use them freely wherever necessary. The administrators should retain exclusive write access to the file, and update the file as necessary, being careful to not break backwards compatibility.
This approach has the added effect of helping prevent issues when services are moved around. For example, if a Kafka broker is moved to another host, the administrators would update the file, and voila, all systems would automatically have the new location.
Hive Scripts
Hive scripts thankfully don't need to refer to host names, other than to run them through Beeline.
That said, I've seen a lot of clients where database names have the environment name in them, such as customer_dev or invoices_prd. This is a horrible habit, and if you are in this habit, I fully recommend you break yourself of the habit soon. This causes your code to be specific to an environment, adding more pain for yourself than necessary. Additionally, this sort of change seeps its way into other places that you may not expect, like a Spark job. Finally, doing this makes the name of the database or table less obvious to users who may be switching between the environments.
If you do have to follow this habit because of a mandate from above, then you can still write agnostic code, it'll just be tougher. You can use Hive variables to help you, passing in environment specific strings through Beeline, and using them in the queries. Then you just have to be sure to include those in your Beeline call.
To use Hive variables in a Hive script, it will look something like this:
You would then use something like the following to pass that database name in, assuming the database name is $MYDATABASE.
Spark Jobs
For Spark jobs, you can follow a similar approach to what we did for Hive scripts, and pass in any environment specific information you need as parameters to your job. This is fairly straightforward, although it does require changes to your calling script whenever you need more information.
If you are using an interpreted API such as Python or R, however, you can do something a bit different that takes away the need for the calling script to know what information the Spark job needs. In this technique, the administrators create a second file called something like environmentVars.py or environmentVars.r. This file would function similarly to the bash script, this time declaring global variables that can then easily be imported. This file would live in the same place that the bash script lives, and also be globally readable. The calling script would just need to ensure the file is included in the spark-submit call, and then the Spark job would just import the file, and use the variables as necessary.
The one thing to keep in mind here is that doing this means you now need to update two locations when information changes. If your company mostly uses Python and R, this may be worth it for the development overhead removed. If there are only a few interpreted Spark jobs, however, it may cause more hard-to-find errors than it is worth, and passing them in directly will be better.
Oozie Workflows
Sadly, Oozie currently doesn't have a great story to solve these sorts of issues. If you are calling a bash script or running a Spark job, then the solutions above still apply. But some Oozie actions, such as the Hive actions, require host names to run the actions.
My best practice I've found is to have the environment specific variables kept in the properties file for the top-level entity that you run. Oftentimes this should be the bundle, but sometimes will just be the coordinator. Only environment specific variables should be kept in this file, with the rest being set in the bundle or coordinator XML file itself. This isolates the environment specific variables, making it easier to figure out what should migrate and what shouldn't.
Once you've achieved this isolation, then you can ensure that you recreate the properties file in the new environment as necessary instead of migrating it. Everything else can be migrated safely with no fear of having issues. After the first migration, then the properties file only needs to be recreated as it has to change, which is hopefully not very often.
Conclusion
Hopefully the above tips got you thinking about how you can make your code environment agnostic, and save time and resources in the long run. There are a few more general tips that I'd like to include here that are important.
If you refer to an HDFS file, such as when you create a Hive table or read a file into Spark, if you can, it is best to just not specify the name node. So, the file path would look something like hdfs:///data/project/someData instead of hdfs://namenode/data/project/someData. This may not be possible if you are referring to a name node that is not the one you will point to by default, so you may need to follow a similar approach to the one presented above. The same goes for cloud storage, where you likely have a different bucket per environment.
In some cases, you can save yourself some time if you are consistent with naming. For example, if you do have to have different database or table names between environments, if you are consistent about using the same string for a given environment (such as prod vs. prd, etc), then you can just pass in that environment name, and it'll work across any resources you need to pull from.
Finally, for other services not mentioned here, the general approach you should follow is to isolate the environment-specific values, and push them up the stack as far as you can. Doing this will ensure that these values are specified as few times as possible, and they are contained in one location that can easily be avoided during migrations.
Do you have any ideas for ways you can make your code environment agnostic that we didn't cover here? Add them to the comments!