Introducing Lumos, Apteligent’s Platform to Catch ‘Em All
August 21st, 2016
Introducing Lumos, Apteligent’s Platform to Catch ‘Em All
In a previous article, Jeremiah explained the impact Niantic’s Pokémon Go had on our new platform. I’m going to give you a high level overview of that platform and talk about why we built it. In future articles, various members of the Apteligent Platform Team will pick up the narrative, presenting each subsystem, digging into the technology that makes it work. We’ll let you know about the problems we encountered, how we solved (or avoided) them, and explain the rough warty bits left over as well as our plans to address them.
Before I present our new platform, I’ll spend a bit of time describing the system it replaced. It’s easy to denigrate that which came before but I’ll try to be fair. So let’s turn the clocks back and look at the Apteligent platform of 2015.
Or, rather, the collection of independent systems that, when combined, provided the full Apteligent service. You see, we had grown really good at building the same thing again and again. Similar to a factory line, we would stamp out each feature, built from the ground up, tuned to solve a specific business problem. Crash reporting? Yep. Here it is, collection tier, processing, storage, API, UI. Done. Userflows? No problem. Collection tier, processing, storage, API, UI, done. Network insights? You get the idea. Each system’s data processing pipeline looks the same from the 10,000 foot architectural view but drop down a bit and each component is strikingly different. The language is different. The implementation is different, testing is different, configuration is different and so deployment and monitoring are also different. No way to run a railroad.
Another problem with this approach is consolidation of data. At Apteligent, we collect an astonishing amount of the stuff and our service needs to be up 24×7. Say what you like about the old system, it was damn good at it, coping with over 30,000 requests every second. Unfortunately, with all of its discrete systems, it was extremely difficult to cross reference that data in a consistent fashion, forcing us to build even more complicated extraction and consolidation infrastructure. We used the results to generate a lot of interesting reports on mobile trends and events, but it just took too much time and effort to build something new. We’d often speculate about the data we collected but putting those ideas to the test was prohibitive. So unless we were really really hot on it, or had trouble sleeping over the course of a few nights, or happened to have the house to ourselves over the weekend because the Significant Other was away visiting family, these ideas were all too often put in cold storage. We actually had a special freezer in the office upon which we could gaze sadly and think of these friends in stasis.
The last major problem with our previous platform was one very close to our VP of Finance’s heart: cost. The platform was growing far too expensive for the utility we were receiving from it and the future leveragability it provided.
Time, then, for a change. We have our top level requirements: we needed a unifying system that was capable of replicating the functionality we had at the time, allowed us to view all of our data and let the Data Science team have its wicked whimsical way with it, and it had to be cheaper than the existing requirement. A familiar refrain. Better, stronger, faster. And cheaper. Easy, right?
A Minor Diversion: Lambda
As part of the research into one of our standalone systems, Userflows, we’d done a spot of digging into Metamarket’s Druid. Though a nice system, we ultimately deemed it too costly at the time for the functionality we needed. But it did introduce us to the Lambda architectural pattern and that was a Good Thing and gave us Inspiration; our new Lumos data platform is based on that.
The Lambda architectural pattern is pretty simple in concept. It’s a bit like RAM vs disk access. RAM is fast but relatively expensive so it’s limited. Disk is cheap but slow. You store theinformation you need to access quickly in RAM and swap it out for disk-stored data as required. In a Lambda system, each event is stored in some sort of (hopefully) cheap storage. A batch computation system churns its way through the data and computes whatever results you need to serve. The batch system is cheap to run because it’s not executing all the time and works well on commodity hardware, a prime Map/Reduce application. The Lambda system also sends a limited amount of data to a realtime layer, usually some sort of stream processing system, which executes the same calculations as the batch layer. Lastly, there’s a serving or API tier which reads data from the batch or realtime layers as necessary.
It’s hard to envision without a picture, so have a look at the one below.
Back to the Action: Lumos
The new Apteligent platform, named Lumos for the light-bringing charm in Harry Potter, is comprised of 5 subsystems: collection, processing, storage, aggregation and serving.
- Data in the form of interesting events (such as crashes, handled exceptions, network performance data, whatever) are recorded and sent by our mobile SDK to the Lumos Collection tier The job of the collector is to very quickly accept events and place them in a queue.
- The Processing tier picks up the events from the queue.
- Once it’s done with them, the Processing tier hands all raw and processed events off to Storage and to the Realtime subsystem of the Aggregation tier.
- The Realtime system frequently aggregates data from a small window data set (for example, all data received in the past minute) and writes it to Serving.
- At fixed intervals, the Batch subsystem of Aggregation lifts processed events from Storage, performs the same aggregations as its Realtime counterpart, then writes the result into Serving.
- Lastly, Serving selects between realtime and batch data in response to API queries generated by consumers.
It’s As Much About The Rules You Break As The Rules You Follow
It’s worth pointing out that Lumos doesn’t exactly match the Lambda pattern. The Processing tier sitting between Collection and Storage was something we discussed at length. Typically Lambda stores all raw events first and processes them later, an approach with the dual advantages of (1) a simple collection and storage pipeline and (2) no temptation to throw anything away in the name of ‘efficiency’ (a cardinal sin committed by our previous system). This is where architectural purity met reality: in many cases raw data wasn’t useful to us and following the Lambda pattern in its ‘pure’ form would prohibit us from building a realtime layer. The best example of this is crash data submitted from iOS devices (and to a lesser extent Android devices).
As the mobile developers amongst you already know, crash information provided by iOS does not come complete with a usable stack traces. Before it’s of much use, the raw crash must be processed against the symbols generated as part of compilation. The process of combining these symbols, on iOS provided as dSYM files, with the crash stack trace is called symbolication. The Apteligent platform groups crashes together in part by looking at stack traces, so in order to group properly, we need to operate on post symbolicated data. Hence the Processing tier, an optional stage that, for crash, performs symbolication. To cover our bases, Processing sends both raw crash and symbolicated crashes to Storage. This way, if anything goes wrong with symbolication, or we upgrade the symbolication component, we can re-process crashes to get the benefits of the upgrade.
Let’s Talk About Crash
So far so good. We’ve talked about Lumos and how it works. But Lumos is a framework and does no real useful work by itself. To make the most of it, we have to build an application on top of Lumos, something that has value to our customers. Notionally, you can think of Lumos applications like this:
As indicated by the diagram, we chose to port the largest of our products, Crash. Peeling back the layers, the Crash architecture on Lumos is shown below, with green denoting Crash-specific components and white showing standard Lumos components and services.
You can see from the picture that we save a lot of time by focusing on the pieces that need to be custom and letting Lumos shoulder the burden of performance, scalability and monitoring. To build Crash on Lumos, we needed to create
1. The ErrorProcessor, which basically does the symbolication work we discussed in the previous section
2. The aggregation jobs, deployed to the aggregation tier’s realtime and batch functions
3. The specialized API, represented in the diagram as the EQL (Error Query Layer) component
Of all of these custom Crash components, I’m least happy with the EQL, architecturally speaking. As an individual component, it’s nicely written and optimized. However, the very fact that we had to write the EQL as a standalone component and not a plugin extension to a more generic Lumos API tier is disappointing. It was basically a time-to-market decision and one we will revisit at a later point.
The API published by the EQL had to be backward-compatible with the API published by its predecessor. This was necessary because we didn’t want to have to re-write the visualization layer provide by the Apteligent Portal. We also could not break consumers of our API by completely changing the contract on them. Again, this caused us angst. There are a significant number of optimizations and, in some cases, completely different API endpoints, we want to provide but, in the interests of both our time, and the time of our customers, we pushed these changes off to a later point.
Follow The Yellow Block Road
Now that I’ve laid it all out, let’s follow crash data through the system and talk about each component and what it does to that data. I’ll spare you the specifics; we’ll dig in at a later time. For the purposes of this explanation, let’s assume the mobile user is called Fred.
- Fred opens his favorite karaoke app, Caterwaul by OffTune, on his iPhone. He searches for his party piece song, Mustang Sally, written by Wilson Pickett but, as a child of the 90s, Fred prefers the Commitments version of it. Unfortunately, when he taps on the version of the song he wants, the app crashes. The Apteligent SDK detects the crash and sends the crash data to the platform.
- The Lumos collector receives the event. At this point the data is ‘hot’, meaning that if the event collector dies, the data is lost to the platform. Fortunately, the SDK has a retry mechanism and will try again if it does not receive an acknowledgement from the collector. In Fred’s case, the collector is hale and hearty and looks up the Apache Kafka topic associated with the ‘crash’ event type and pushes the data to it.
- Kafka receives the crash event. We have Kafka configured with a replication factor of3 (RF3), which means that the data is now ‘warm’. We can lose up to two nodes in the Kafka cluster without a data integrity incident.
- The KafkaETL (KETL) component reads the raw crash from the Kafka topic and pushes it to S3. Now the data is ‘cold’, meaning that we could lose every component in the system without suffering data loss. This would be bad, and nobody would get any sleep, but it’s a far cry from a disaster. We’ve preserved the data integrity of Fred’s crash in the event of failure. If necessary we can replay these raw events back through the system starting at the EventProcessor and everything will be fine.
- Whilst KETL is persisting Fred’s crash in its raw form, the ErrorProcessor picks up the same event from the same Kafka topic. It examines the crash and looks up dSYMs for Fred’s karaoke app and the iOS version Fred has installed on his iPhone. Things can get tricky here if we don’t have the necessary symbols. The ErrorProcessor is optimistic, insofar as it will attempt symbolication if it can find the symbols associated with the app, even if that results in partial symbolication due to missing library or system symbols. If there are no symbols present it will lump the crash into what we call the Sentinel. You can think of the Sentinel Crash as a bucket into which the ErrorProcessor will throw crashes it can’t symbolicate. The Sentinel allows the system to keep accurate counts despite not having enough information to fully process the crashes it receives. This is a bit different from how the previous system worked and we’ll talk about why we made this change in a later post. For now, returning to Fred’s crash, the ErrorProcessor is happy because it found all the information it needs, so it completes symbolication and pushes the crash onto another Kafka topic dedicated to holding symbolicated crashes.
- Kafka receives Fred’s symbolicated crash and persists it across its cluster in the same RF3 configuration we discussed in step 4. RF3 isn’t strictly necessary here. RF2 would probably be fine. But symbolication is expensive and we want to avoid doing it if possible, so instead we pay the additional replication cost.
- The Aggregation Tier’s realtime Spark Streaming cluster reads a microbatch of data from the symbolicated crash Kafka topic. Fred’s symbolicated crash is in this particular microbatch. The Spark jobs compute a number of statistics for the crashes scooped up in the batch. Fred’s crash is added to those rollups and Fred himself is added to the ‘unique user count’ HyperLogLog used to estimate the count of all of the users affected by the same crash Fred just experienced. The realtime tier pushes results to realtime-specific tables in Cassandra. These tables keep data for the past 24 hours before it ages out in favor of data computed by the batch layer.
- Much of the data in the symbolicated crash is not required to compute statistics. For example, the symbolicated stack trace isn’t used, nor are breadcrumbs. We do, however, need this data to show to the OffTune developers, so they can make sense of the crash, figure out how to fix it and push a new version of Caterwaul to the app store. The full crash data is stored in S3 by the Blob Writer. Unlike the KETL, which stores everything it sees in S3, the Blob Writer has some smarts about it, storing a fixed number of event instances indexed by app/version/crash combination. It is the Blob Writer that governs the number of crash instances the OffTune developers can see for each version of Caterwaul, as well as the number of distinct breadcrumb trails.
- Concurrently, we have a KETL instance reading symbolicated data and pushing it to S3 in bulk. Fred’s crash is intermixed with the crashes of all other apps at this point. This is efficient but a bit of a downer, because it means that we can’t give our customers access to raw crash data in S3. We’ve been talking about uncrossing the streams at some point in the future. Watch this space for details.
- Everything’s in S3. Yay! Safe as houses!
- The same aggregation jobs used by the Spark Streaming cluster in the realtime layer are employed by the batch layer to pick up data from S3 and aggregate over a much larger time window. Fred’s crash is scooped up here and included in the stats. This allows us to overcome another limitation of the previous system, where it would count each crash as occurring at the time it was received by its collector, not at the time it occurred on the device. Reporting data by the occurred time has an interesting effect on the graphs shown by the Apteligent Portal. If a crash comes in much later than when it occurred, numbers can go up for a given time slice. For example, the OffTune team might examine last week’s crash metrics for version 1.1 of Caterwaul on Wednesday and then see increased numbers for the same time period when they come back to look the next day.
- It is the responsibility of the EQL (Error Query Layer) to retrieve crash metrics and specifics for any given time period. It’s a publicly exposed API used by our customers, in this case OffTune, to pull data into their own systems. The API is used by our Adobe Analytics (FKA Omniture) and Splunk connectors, as well as a host of other integrations to pull aggregated data into whatever system tickles your fancy. It’s also used by our portal to provide visualizations of key data.
I hope you enjoyed this overview of the Apteligent Lumos platform. In subsequent posts we’ll go into more details of the technology choices and the inner workings of the components I’ve described above. We’ll talk about what has worked for us, what hasn’t worked, where we got things right (deliberately and inadvertently) and where we just flat out failed and had to fix. In the meantime, if you have questions, please ask below.
 Technically only iOS and Android NDK require symbolication. Android optionally uses ProGuard to shrink APK size and to obfuscate code. Our platform performs both symbolication and ProGuard mapping at the Processing stage. Internally, we refer to the different processes on iOS and Android collectively as ‘symbolication’.
 Actually, that’s my party piece. If you come to one of our Apteligent karaoke nights, I’ll sing it for you. We’ll also give you beer, which will in large part make up for the experience. Then onto the Vogon poetry!