Tracking Down Small Files with Smartsense Dashboards

David McGinnis

Nov 20, 20195 min read

Overview

Last week I mentioned Smartsense Dashboards, including an error we were seeing with the dashboards. If you missed it, you can find that post here. That post includes a good background on what Smartsense is, so I recommend reading it if you've not heard of this useful metrics tool.

One of the places where this tool really shines is monitoring the usage of HDFS resources, including the small files problem (previously discussed here). I'm going to walk you through some of the dashboards which I monitor in Smartsense when dealing with HDFS small files, as well as what to watch for and how to use them.

All of these are either available by default in the HDFS dashboards section, or they are slight tweaks to existing dashboards that you can make yourself. All of the data is gathered through Phoenix from data gathered by the Ambari Metrics Service (AMS). This makes it really easy to construct SQL queries and try out new visualizations quickly.

The visualizations below are from a real production cluster, and any identifying information has been removed. There are two dips in the data, caused by the issues I've discussed in the past. Because of that, those dips can safely be ignored.

File Size Distribution Trend

The first dashboard we'll discuss is one of the first default graphs you'll see in the dashboard, and one of the most interesting. It shows, over time, how many files were in the cluster, stratified by the size of the file. Doing it this way not only allows you to see how many of your files are what sizes, but see how that changes over time. This visualization is what I would recommend monitoring at all times if you only want one visualization to track during routine maintenance.

This dashboard is really useful to both get your current status when you first start optimizing the number of files in your cluster and also track your progress in fixing the issue as you go. Looking at the graph above, you can see that we have very few files that are anywhere near the block size on the cluster. This is a huge issue since that means we are wasting a lot of heap space on files that don't have much in them. You can also see the spike around 10/31, which initiated the process of addressing our small file issue. Since then we were able to remove some tiny files, but we obviously have a long way to go still.

We could do some variations on this visualization, grouping by type of file or file path. All of these could be useful, based on what information it is we need to understand about the situation.

Changes in Number of Files by User

The next step once we know the overall status is to find which users on our cluster have all of these files. That's where the next visualization comes in, which n shows the number of files per user over time.

This is a custom visualization that I've created using Phoenix, performing a simple group by user ID. I have restricted the output to only show the last 2 months, in order to make the data easier to read.

From this, we can not only track which users are currently filling up the cluster, but also verify how they are doing in cleaning up their files. As we can see here, the pink user was able to get rid of a lot of files quickly and has stayed steady since, while the blue user is steadily removing files from the cluster.

On this visualization, you may want to change the time span you're looking at, or limit the number of users, in order to clean up the graph. Or you can adjust it and turn it into a table to get something like the next visualization.

Daily and Monthly Changes in File Count Per User

This paragraph isn't so much a visualization as a table, but it is still really useful. When the issue is especially severe, it's common to have daily meetings with all of the teams to work on reducing the number of files in the cluster. In these instances having a table like the one above is great, since it shows us the day to day change in files, to ensure that teams are making headway towards our goal. The monthly difference allows us to get an idea of the overall difference. You could use a specific day if you like instead.

The one issue with this visualization is that you have to remember we are dealing with an active cluster. That is to say that applications are actively ingesting data at the same time we are trying to reduce the number of files, so you're liable to see fluctuations day-to-day when looking at the daily change numbers. For this reason, it's not a good idea to focus too much on these numbers, but as long as they make some sense, then it is reasonable to use it as a monitoring tool.

Recent File Generating Applications

The last paragraph I use regularly for small files is a bit different from the others. This table shows the top 10 applications that have been run over the last 15 days for generating HDFS files. This allows you to see patterns in who is consistently generating files, and what jobs, in particular, are causing these.

This table was based on one which comes by default with Smartsense, which shows the jobs which generate the most files. Looking at individual jobs is often not very useful, so I joined the job data with the table which maps jobs to applications, and then grouped by application ID and summed the HDFS files together. This is by far the most complex of the queries I talk about here, but it still isn't too difficult for someone with a basic SQL background.

A variant on this that may be useful is to group the values by name instead and show the average. This would allow you to get the top 10 types of applications which are causing issues. Note that this depends on application names being meaningful enough that grouping them makes sense. That isn't always true, but if it is, this may give you even more insight into your applications that are generating files!

Conclusion

This has been a very quick review of the dashboards I've personally found useful. There are plenty more in the HDFS dashboard, as well as the other dashboards. For example, any of the visualizations above can be done for file size instead of the number of files. This would be useful if your HDFS capacity is running low.

The other dashboards have things such as information about wasted memory in your YARN jobs, recommendations on how to optimize your applications, and even a dashboard custom made to allow for determining how much to charge your client teams if you are using a charge-back model for your organization.

Hopefully, this has given you some ideas on how you can use Smartsense, and gets you interested in