≡ Menu

APM

Splunk vs ELK

If you are in IT Operations in any role, you have probably come across either Splunk or ELK, or both. These are two heavyweights in the field of Operational Data Analytics. In this blog post, I’m going to share with you what I feel about these two excellent products based on my years of experience with them.

The problem Splunk and ELK are trying to solve: Log Management

While there are fancier terms such as ‘Operational Data Intelligence’, ‘Operational Big Data Analytics’, ‘Log data analytics platform’, the problem both Splunk and ELK are trying to solve is Log Management. So, what’s the challenge with Log management?

Logs, logs, logs and more logs

The single most important piece of troubleshooting data in any software program is the log generated by the program. If you have ever worked with vendor support for any software product, you have been inevitably asked to provide – you guessed it, Log files. Without the log files, they really can’t see what’s going on.

Logs not only contains information about how the software program runs, they may contain data that are valuable to business as well. Yeap, that’s right. For instance, you can retrieve wealth of data from your Web Server access logs to find out things like geographical dispersion of your customer base, most visited page in your website, etc.

If you are running only a couple of servers with few applications running on them, accessing and managing your logs are not a problem. But in an enterprise with hundreds and even thousands of servers and applications, this becomes an issue. Specifically,

  1. There are thousands of log files.
  2. The size of these log files run in Giga or even Terra bytes.
  3. The data in these log files may not be readily readable or searchable (unstructured data)

Sources_of_logfiles (4)

Both Splunk and ELK attempt to solve the problem of managing ever growing Log data. In essence, they supply a scalable way to collect and index log files and provide a search interface to interact with the data. In addition, they provide a way to secure the data being collected and enable users to create visualizations such as reports, dashboards and even Alerts.

Now that you know the problem Splunk and ELK are attempting to solve, let’s compare them and find how they are achieving this. I’m going to compare them in 4 areas as follows:

Technology

Cost

Features

Learning Curve for the operations team

Got it ? I can’t wait to share. Let’s dive in.

Splunk_image

VS

ElasticSearch Logo

Technology

Witnessing C++ vs Java has never been more exciting

While Splunk is a single coherent closed-source product, ELK is made up of three open-source products: ElasticSearch, LogStash, and Kibana.

Both Splunk and ELK store data in Indexes. Indexes are the flat files that contain searchable log events.

Both Splunk and ELK employ an Agent to collect log file data from the target Servers. In Splunk this Agent is Splunk Universal Forwarder. In ELK, this is LogStash (and in the recent years, Beats). There are other means to get the data in to the indexes, but the majority of the use-cases will be using the Agents.

common_components

While Splunk uses a proprietary technology  (primarily developed in C++) for their indexing, Elastic Search is based on Apache Lucene, an open source technology written fully in Java.

On the Search interface side, Splunk employs a Search Head, a Splunk instance with specific functions for searching. ELK uses Kibana, an open source data visualization platform. When it comes to creating visualizations, in my opinion, Splunk makes Kibana look plain. (Note: It is possible to use Grafana to connect to ELK to visualize data. Some believe Grafana visualizations are richer than Kibana). With recent versions of Kibana, you also have Timelion, a time series data visualizer.

For querying, while Splunk uses their proprietary SPL (Splunk Porcessing Lanugage, with syntax that resembles SQL-like statements with Unix Pipe), ELK uses Query DSL with underlying JSON formatted syntax.

Let me summarize the technical info in the table below.

Screen Shot 2017-12-08 at 6.06.27 AM.png

In the end

Both Splunk and ELK are fundamentally very sound in Technology. Though one can argue one way or the other, the longevity  of these two products in the marketplace prove that they both are indeed superior in their own way. However, Splunk differs in the crucial Schema on read technology.

With Schema on read, there is minimal processing required before indexing. In fact you can throw anything at Splunk as long as Splunk can determine the Host, Source (File where data is coming from) and Source Type (a meta field that helps Splunk to determine the type of the log file, manually/automatically determined).  The fields are generally determined ONLY at the search time.

However, with ELK, you must provide the field mapping ahead of time (before indexing). One can certainly argue that this is not necessarily bad. But I’m going to leave it up to the community to decide.

Cost

Is Open-Source really free?

Cost of the software: ELK is free. Splunk is not.

Splunk’s license fee is based on Daily Log Volume that is being indexed.  For example, you may buy a 1TB license which will let you ingest up to 1TB per day. There is no cost for keeping the historic data. It is only the daily volume that counts (the License Meter resets at midnight every day). Further the cost is NOT based on the number of users or number of CPU cores either. You can get either a Term license, which you pay per year. Or you can get a perpetual license, which is just one time fee plus annual support fee (if any).

I’m unable to give you a dollar figure as it greatly various based on geographic location of your business, and obviously on the data volume (and the sales team you are working with :-)). But in general, compared to other commercial products in the market (SumoLogic, Loggly, Sematext etc), Splunk *may* be on the expensive side. (Again, too many variables to give you a black-and-white answer).

ELK is open source. You pay nothing for using the software.

But, and this is a big but, the cost of ownership is not just the cost of software. Here are other costs to consider.

  1. Cost of Infrastructure. Both Splunk and ELK would require similar hardware infrastructure.
  2. Cost of implementing the solution. This is a big one. For example, when you purchase Splunk, you might get some consulting hours that you can use to implement your solution. With ELK, you are on your own.
  3. Cost of ongoing maintenance: This can also be a big one. Once again, you might get some support hours from Splunk, but with ELK, you are on your own.
  4. Cost of add-ons and plugins: Both Splunk and ELK have plugin/add-on based solutions that extend the fuctionality. Some are free and some are not. For example, you will have to pay for Shield (ELK Security) and ITSI (Splunk IT Service Intelligence)

In the end

Yes, Open source is free. But is it free, as in free ? The biggest problem you will face, as an evangelist of ELK in your organization, is coming up with a dollar amount of the cost. As for Splunk, you have to be able to convince your organization of the cost. At least in this case, the cost is predictable.

Features

Looking for somethin? There is an app for it.

Both Splunk and ELK have myriad of features. When I say feature, it can be any of the following:

  1. Support for a certain type of data input. For example, does it allow data input via HTTP, or a script ? So, earlier when I said both Splunk and ELK employ an agent to collect data, I lied. Both the products support several other means of getting data in.
  2. A data visualization functionality. For example, does it allow creating custom dashboards, reports, etc? How feature rich are they ?
  3. Integration with other products/frameworks. For example, can it send/receive data from APM products such as NewRelic, Dynatrace or AppDyanmics ? Can it send/receive data from Hadoop ? Both the products integrate well with many major platforms.
  4. Security Features: Does it support Role Based Auth, two-way SSL or Active Directory Integration? With Splunk, security is available out of the box. But with ELK, you pay for Sheild (or Xpack in the recent versions)
  5. Data manipulation: How easy is it to modify the data being ingested? Can I mask sensitive information readily? Splunk provides powerful regular expression based filters to mask or remove data. Same can be achieved with Logstash in ELK world.
  6. Extensibility: Can we easily extend the product by writing our own solutions?
  7. Metrics Store: Indexing text (log files) is one thing. But indexing Metrics (numerical data) is another thing. The performance of indexing and search is astronomically higher on a Time series Metrics index. Splunk has introduced this in their Version 7.
  8. Agents Management: How are you going to manage hundreds or thousands of Beats or Splunk Universal Forwarders ? While ansible or chef can be used with both the products, Splunk has an advantage of letting you manage the universal forwarders using their Deployment Manager (A Splunk instance with a specific function)

In the end

Since both ELK and Splunk have a strong user community and good extensibility, there is no shortage of plugins and add-ons. (In Splunk world, there is the notion of apps).

splunkbase.splunk.com (https://splunkbase.splunk.com/)

Screen Shot 2017-12-10 at 11.58.31 AM

Elastic Search Plugins (https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html)

Screen Shot 2017-12-10 at 4.42.51 PM

Learning Curve for the operations team

From 0 to 60mph in 3 seconds. Really?

Both Splunk and ELK have massive product documentation. Perhaps too much documentation if you want to get started rapidly. The learning curve for both the products is steep.

For both the products, a solid understanding of Regex (Regular Expressions), Scripting (Shell/Python and the like) and TCP/IP is required.

For performing searches, you must learn SPL (Splunk Processing Language) for Splunk, and Query DSL for Elastic Search. SPL is like unix pipe + SQL. It has got tons of commands that you can use. With Query DSL, the queries are formatted in Json. In both the products, the search language can easily overwhelm a new user. Just because of the sheer amount of features that SPL provides, Splunk can be much more intimidating than ELK. (In fact, there are 143 search commands in SPL in their Splunk Enterprise 7.0.1).

Creating visualizations also require some learning. Here again, Splunk proivdes more features and might look more intimidating than Kibana for a new user. Note that you can also  use Grafana to connect to ELK to visualize data.

Perhaps the biggest hurdle you will face with Splunk is the server side Configuration and Administration. The entire product is configured using bunch of .conf files. One will need intimate knowledge of the specification of these configuration files in order to implement and support Splunk. While ELK does require some reading on server side setup, it’s not nearly as much as Splunk.

In the end

Spunk does have a steeper learning curve compared to ELK. But whether it is a showstopper for you or not is something you have to decide. You will have to invest in few resources with solid Splunk experience if you want to implement and support the solution correctly.

So, there you have it. Splunk Vs ELK at a super high level. I haven’t gone deep in technical aspects for brevity. But there is plenty of documentation for both Splunk and ELK online. Just get lots of coffee before you begin 🙂

Let me know what you think.

Happy Monitoring !

 

So we all know that Java Heap is a crucial resource, lack of which will kill your application. Naturally you will want to monitor the heap usage. Bur surprisingly it is not very straight forward to measure the heap usage of your JVM unless you have a modern APM (Application Performance Management) tool implemented. To make things worse, in Windows world, the memory you see in Windows Task Manager (a solid tool by btw) is NOT the same as the JVM heap size.

For example, let’s say you have set maximum java heap at 3GB. But it is quite possible that the memory shown by Task Manager be higher (much higher at times), say 3.5 gb or 4 gb. You can pull you hair trying to figure out where that extra memory utilization came from, or you can read rest of this article and put an end to the mystery.

Generally speaking, here is the reason memory shown by the Task Manager is more than the Heap: The memory shown in Task manager is the entire Memory footprint of the JVM and NOT just the Java Heap of the JVM. Note that the JVM is just an another process as far as Windows is concerned.

Read More

What you didn’t know about java.lang.OutOfMemory Error!

Few errors are deadlier than java.lang.OutOfMemory errors. When this happens, you application will run into a very unpredictable state, often hanging up and not processing new requests, coughing up ugly stack traces into users browsers etc. The most popular (and effective in the short run) fix for OutOfMemory error is simply restarting the Application or Application Server.

While running out of java Heap is the most common OutOfMemory error, there are indeed several types of OutOfMemory that can occur. In this post, I will show you these various types of OutOfMemory errors and what they mean.

java.lang.OutOfMemoryError: Java heap space

The most popular one. This is when JVM goes belly up and unable to allocate memory in Heap . Your application might appear hung up and extremely slow to respond to user requests.

For example, let’s say your Java application has 2GB of memory maximum (through -Xmx flag). When the entire 2GB is used up and GC is unable to reclaim any memory, the next memory allocation request for an object will fail with java.lang.OutOfMemory Error.

What causes java.lang.OutOfMemoryError: Java heap space ?

Read More

One important change in Memory Management in Java 9

Java 9 is supposed to be released in summer 2017. The biggest changes appear to be around modularization which enables  you to create, build, deploy and reuse modules of code. There is also the new jshell command line tool that lets you test snippets of code quickly. From Application support standpoint, I see one major change in memory management and that’s what I want to discuss today.

GC (Garbage Collection) is the bread and butter of Memory Management in java. GC  is responsible for cleaning up dead objects from memory and reclaiming that space. GC executes its cleanup using predefined Collectors that use certain algorithm(s). As I mentioned in my previous post, there are 4 Collectors that can be used.

  • Serial Collector
  • Parallel Collector
  • CMS (concurrent Mark and Sweep) Collector
  • G1 Collector

Until Java 1.8, on Server class machines (2 CPU and 2GB RAM minimum. Humor me if your server is leaner than this size), the default collector is Parallel collector.

Parallel collector is all about throughput. It does not care about having longer GC pause times (as it will stop the application threads before performing a GC). If you cannot tolerate long GC pause times (upwards of 1 second), you could choose CMS collector.

G1GC was introduced in Java 1.7 to address both increased throughput and decreased GC pause times overall.

Here is the major change with this edition of Java. G1GC is the default collector in Java 9.

Read More

In this post, I’m going to explain how Heap is organized using generations and how Garbage Collection works behind the scene to free up memory. Java memory management has evolved a lot over the past few java releases.  Understanding the mechanics underneath will help you better tune it (if required) to suit your needs.

When your java application runs, it creates objects which takes up memory space. As long the object is being used (i.e referred by the application somewhere), it is going to occupy the memory. When the object is no longer used (for example, when you cleanly close a DB connection), the space occupied by the object can be reclaimed by Garbage collection.

What are generations ?

Java heap is typically divided into two major pools. An area where short-lived objects live. And an area where long-lived objects live. Young generation (aka Nursery aka Eden) is the place for short-lived objects and Tenured generation (aka Old) is the place for long-lived objects.

Screen Shot 2017-05-23 at 7.20.32 PM

Why Generations ?

Read More

How to monior Heap usage of a Java Application ?

Heap is the memory space where java application objects are stored. Unfortunately the size of Heap is limited (by the -Xmx java command line option). When all of heap is used up, you are in deep trouble. So, it helps to understand how to measure Java heap usage, and that’s exactly what I’m going to be discussing with you in this post.

Very surprisingly, measuring heap is not that straight forward in java world. I believe it is due to the fact there are numerous ways to measure java heap. Oracle seems to introduce new tools with every release (and retire the old ones). For this reason, I recommend everyone to invest in a commercial grade APM (Application Performance Management) tool. With APM tool, you will simply login to the APM interface (typically a web application) to view the JVM heap usage (along with tons of other useful metrics)

Using the command jcmd to measure heap usage

Java comes with a command line tool named jcmd. It should be available in most flavors of Java (Oracle, IBM etc). I’m going to recommend jcmd to measure your heap usage. One of the advantages of using jcmd is the class histogram which not only shows the heap usage but it breaks down the usage by class. This gives you an instant view of what is causing heap to fill up.

Important Note: jcmd does have some impact on the jvm. But I assure it is worth the price, as the details shown by jcmd are very valuable.

Important Note: the command jcmd must be run as the same user (or effective user) as the user running the java application

Let us see jcmd in action.

Read More

By default, AppDynamics captures tons of useful metrics from your java application. Average response time of your Applications, JDBC response times, Throughput, Heap and Garbage Collection Metrics etc. In addition, AppDynamics automatically captures  Transaction Snapshots (that provide deep diagnostics) periodically and during slow response times. The snapshots will reveal the hotspots up to the method level.

But at times, you may want to monitor a particular method for performance metrics. For example, you may want to know how often a method named ‘cancelOrder’ is called and how long it takes to process. For requirements like this, AppDynamics provides a neat way of instrumenting your java method – Information Points.

Infopoints

To configure a new Information Point, navigate to Analyze -> Information Points. Click on the New button at the top.

Read More

One of the major uses of an APM such as AppDynamics is the ability to collect application data at method level. For example, let’s say you have a method named processOrder in your code that accepts OrderID and Username as parameters.  What if you want to capture the Username as part of the performance metrics ?  This can be extremely valuable, for instance,  to identify which user could be submitting orders that are very slow to process. In AppDynamics, you can achieve this by using Data Collectors. In this blog post, let me show you how it is done.

  1. Log on to AppDynamics Controller UI
  2. Navigate to Configuration -> Instrumentation and click Data Collector tab
  3. You have two types of Data Collectors to choose from
    • Method Data Collector – Captures method arguments and return values.
    • HTTP Data Collector – Captures URLs, Headers and Cookies
  4. Click on Add under the Method Data Collector
  5. Configure the Method Data Collector as follows
    • Name: Provide a meaningful name for this Data Collector
    • Select Apply to new Business Transactions. If you don’t select this, you have to manually select the Business Transactions.
    • Provide the method signature (identifying a method) by its fully qualified class name and method name.
    • If the method is overloaded, i.e the same method name with various arguments, you need to select Is this Method Overloaded check box and choose the method signature you want to monitor
    • Optionally specify match condition to choose the method
    • Under Specify the data to collect from this Method Invocation, enter a display name (this is the name that will show up in Transaction Snapshots under User Data) and the data to collect. You can collect method arguments and/or return valuesAppd
  6. Note that you can also configure HTTP Data Collector if you wish you capture HTTP metadata such as URLs, Cookies and Header values.

Read More

Glad you are here. It’s time to take a deeper look at all the various monitoring metrics that AppDynamics provides. We have so far done the following:

  1. Installed AppDynamics Controller
  2. Installed a sample application (Node.js,REST and mysql) and installed appdynamics agent
  3. Ensured the application shows up in the controller UI.

Note that we have not customized any monitoring elements yet. i.e whatever we are going to see is out-of-the box, which is one of the coolest things about AppDynamics. Let’s start.

How to view the Flow Map of the Application ?

Application flow map provides an excellent view of the data flow of the application. AppDynamics automatically tracks all the inbound and outbound calls of the instrumented application and attempts to draw this flow map. It is like an interactive Visio Diagram of the application infrastructure architecture that you did not create. (AppDynamics created it for you). Let’s take a look at the Flow Map of the sample application we have instrumented.

Read More

So you have installed AppDynamics Controller, instrumented a simple Java Application and even viewed it from AppDynamics Controller UI. You have come a long way. Excellent work! In this blog post, I’ll take you a step further by showing you how to view critical Application Metrics in the UI. In order to get to see various monitoring metrics, let us use a more sophisticated application (one with a backend database).

Fortunately a sample application is available at github specifically for this purpose. Kudos to the developers who made this available to all of us. The application is based on Docker (no surprise there) so you need to have Docker installed in your server that you are planning to use as the Application host. If you don’t want to go this route, simply instrument one of your applications (preferably in Dev environment).

Here are the steps to install the docker based sample application.

  1. Clone the git repo
git clone https://github.com/Appdynamics/SampleApp.git

Screen Shot 2017-01-01 at 11.51.44 PM

Read More