DEV Community: Robert Beekman

Launching Public Status Pages for Uptime Monitoring on AppSignal

Robert Beekman — Wed, 15 Sep 2021 11:56:00 +0000

Since the launch of uptime monitoring, we have received a lot of positive feedback. There were also a couple of much-requested additional features that we hope to address in this huge update.

Configurable Regions for Uptime Monitoring

We started uptime monitoring from a few regions: Asia, North America, South America and Europe. But not every app has to be monitored from all regions. It doesn't really matter if performance is poor from South America for a Europe-centric app, for example.

That's why we're introducing an option to select the regions where you'd like to monitor your app:

Public Uptime Status Pages

The most requested feature was to somehow expose uptime monitoring metrics as public status pages.

Starting today, it's now possible to create public status pages for your uptime monitoring metrics:

You're able to select multiple uptime monitors across your organization to show your uptime on this status page:

These uptime monitors will be shown on the public status page, where customers can click to see specific details about each monitor:

You can also post updates to this status page to let your customers know about any issues. The state of the page is determined by these updates, so you have full control over the status:

Future Uptime Monitoring Features

Right now, status pages will run on a customizable subdomain of our appsignal-status.com domain (e.g. <yourcompany>.appsignal-status.com). We're planning to release custom domains in the near future.

Do you have any other ideas for public status pages? Don't hesitate to let us know.

Uptime Monitoring Sprinkled with Stroopwafels

If you haven't had the chance to test AppSignal and uptime monitoring, here's what you need to know:

Uptime monitoring is included alongside all of our features.
We have a free trial option that doesn't require a credit card.
AppSignal supports Node.js, Ruby, and Elixir projects.
We're free for open source & for good projects.
We ship stroopwafels to our trial users on request.

Need we say more? 🍪

How to Read Performance Metrics in AppSignal

Robert Beekman — Tue, 12 Nov 2019 13:46:18 +0000

In this post, you'll learn which metrics to keep an eye on to improve your application performance, how AppSignal works, and how to interpret the data it generates. Grab a stroopwafel, make yourself comfy, and let's start!

The Importance of Performance

Most developers understand that it's critical to catch and track exceptions in their application. Some of them use tools such as AppSignal to capture errors and monitor performance, and others set up their own monitoring systems. Either way, you make sure that none of your users experience looking at an error page in the middle of their workflow. E.g. A user is on the checkout page, while their basket is filled to the brim with all sorts of products. However, instead of the checkout page, there's an error message on their screen. The chances are that the user will leave your website and shop somewhere else. While error reporting is generally accepted as a requirement for production sites, performance is often ignored.

In the past few years, several studies have shown that website visitors aren't exactly always the most patient of people. For example, DoubleClick by Google found 53% of mobile visits were abandoned if a page took longer than 3 seconds to load.

These days it's just as important to know how your application is performing in production as it is to know if there are any errors. While errors have a nicely defined way of occurring and handling (an error either happens or it doesn't), the question “what is a good performance” is a lot more challenging to answer.

The answers differ from page to page and also depend on the type of users you have. For example, people are more likely to accept a slower response time for pages with a lot of dynamic content customized to their preferences.

Let's dive into what makes up a typical web response and how each part plays a role in the performance.

A Typical Request

A typical request starts with the browser making a request for a specific page. In the case of Rails/Phoenix, your webserver will accept this request and route it to a controller that handles the request. The controller usually contains one or more database queries that have to be executed in order to retrieve the required data. This data is fed into a templating system that will convert the data to HTML (or JSON).

Several actions are happening during a request that can influence the total response time. Your database is most likely to be the main influencer. Complicated queries on data that isn't indexed results in slow response times.

The templating system can also influence the response time, as complicated loops in the template can increase the duration of a request.

Instrumentation

In 2010 Rails 3 was released, and this included a new feature called ActiveSupport::Notifications. With this system, it was possible to instrument certain parts of your code and track how long it took to execute this code and collect the data for further processing.

This feature allows users to track how long a database query took, or how long it took Rails to process a request. With this information, it's possible to pinpoint performance issues in certain parts of your application such as:

Database queries,
View render,
Framework overhead,
Controller code, etc.

How AppSignal Hooks Into the Framework(s)

AppSignal listens to the ActiveSupport::Notifications and stores them locally until the request has finished. It then processes the request data (e.g. removal of identifying things such as passwords) and sends this data to our Agent.

The Agent is started when your application starts and is responsible for collecting and aggregating the data for your application and periodically transmits this data to our servers. By aggregating the data in the agent we make sure only to send relevant data to our servers limiting bandwidth usage.

The request data is compiled into several entities such as samples and metrics.

Samples are the result of all data that your application generated during a request. These are all the events a user had to wait for before the server could respond with view data, such as HTML or JSON, and are specific to a single request.

Metrics are aggregated data such as the mean duration of all requests for a certain controller or even globally of your entire application. You cannot identify a single request in this data, but it shows overall performance by using mean, 90th and 95th percentiles. We use this data to generate graphs to track performance over time. We've written a post in the past about how to interpret these aggregated metrics in the post: Don't be mean: Statistical means and percentiles 101 and I highly recommend reading this post to understand what these metrics mean and how they portray your application's performance.

How to Debug a Performance Issue in the AppSignal UI

Now that we have data coming in from your application, it's time to figure out how this data helps you be aware of, and debug performance issues in your application.

Incidents

The sample data collected by the Agent will result in what we call “Performance Incidents”. These incidents revolve around a single controller action (or a background job) in your application and show the performance of this controller action.

A nice way to make your application faster is to go through the Incident list and sort your table by “impact”. If you were to make an action faster, Impact is a metric that tells you how much improvement you will have in your application.

In general, it's best to optimize actions that are requested a lot of times and have a high duration. You can spend days optimizing an action that took 2 seconds to respond, but if only a single person was impacted by this query it's probably best to spend time optimizing that 500ms action that was requested 10.000 times in the last hour. Of course, this all depends on what action it was.

Clicking on an incident brings you to the sample page.

Samples

These samples contain request data such as Parameters, Session data, Environment data, and most importantly the “Event tree”.

The event tree is the result of all the Activesupport::Notification data we collected and will show exactly when a database query was executed, the (anonymized) query data, and the query duration.

You can nest Activesupport::Notification calls and the event tree will indent when a nested call is detected. We also detect if a certain event was executed more than once. The most known reason for this happening is an N+1 query. If you see this in your event tree, you can find out what it means and how to fix it in our guide: ActiveRecord performance:
the N+1 queries anti-pattern.

On the incident page, you can set up alerts that will send a notification to your email (or Slack, HipChat, PagerDuty and many others) when we collect a sample that crosses the set threshold (by default this is 200ms). You can also comment on the incident or send it to an issue tracker such as GitHub, GitLab, or Jira.

Graphs

While samples give an excellent deep-dive look into what's happening for a certain controller action (or a background job), this does not provide a good overview of how this controller action (or even your entire application) performs over time. An application might still feel snappy, but the response time could have increased a lot with the release of new features or because of the high application usage.

The “graphs” screen in the performance section uses the collected metrics and will give an overview of the performance of your application. In combination with deploy tracking, you can easily spot performance degradations in your application. Setting the custom date range to 30 or even 90 days will show if your performance is still consistent with what it was a while ago, and this way you can prevent what is called “The boiling frog” of your application.

You can also find “event metric” graphs on this page, but we'll get into those later.

Actions

The “Actions” section of the application is a different view of the collected metrics for each controller action.

The “Actions” table will let you see all the controller actions we've detected in your application, and you can see a summarised throughput and response time for the selected timeframe. This page helps you answer questions such as “how many requests did this action do in the last 30 days”.

Clicking on an action takes you to the action page with a list of errors and graphs for the error rate, throughput, and response times. This is the same data as in the “graphs” page, but specific to this action. You can see if your latest deploy had any impact on this action's performance.

Event Metrics

Besides metrics for each action, we also collect metrics for each event that was emitted with the ActiveSupport::Notification call. This means we collect the throughput and performance for individual database queries and template render calls.

You can find all the collected events on the “Event metrics” page. The ActiveSupport::Notification naming convention is that of group.event, where the group can be active_record or mongodb for database queries and action_view for template renders.

Other groups on this page can be net_http, active_job or 3rd party instrumentation such as sidekiq or even your own instrumentation calls.

Clicking on a group will take you to a page showing all individual events of that group (e.g. individual database queries or view render calls). In turn, clicking on an individual event will take you to a page that shows metrics for this event and in what controller actions (or background jobs) this event was seen.

As with Incidents, you can also sort these tables by impact to get a nice to-do list for query optimization.

Other Metrics in AppSignal

There are many more metrics that we collect and features we expose in AppSignal to make your application faster, but the ones mentioned above are a good starting point in making your application faster to provide your users with a pleasant browsing experience.

If you are comfortable with these basics you can dive deeper into optimisation by implementing caching, track host metrics or set up alerts with custom metrics and anomaly detection.

You’ve Passed Performance Metrics 101!

Today we went through instrumentation, incidents, we dove deeper into these with samples, and we talked about visualising with graphs. We also went through the basics of events and actions. That concludes performance metrics 101. Yay!

If you have any questions or comments, don't hesitate to contact us.

Kafka and Ruby, a Sidekiq lovestory

Robert Beekman — Thu, 25 Apr 2019 09:12:59 +0000

In today's article, we’ll cover performance from a different angle: The choices we made in our stack.

Usually we write about changes and features we release for AppSignal that are public on our changelog and here on the blog. But besides these public-facing features, we also spend a lot of time on making sure AppSignal can cope with the growth of traffic.

Because we are developers ourselves working on problems like this, we think we do a pretty good job at helping you as well with our APM (shameless plug 🤪). But today we use that experience to discuss our own stack. We will go over one of the bigger changes we made in the past few years ourselves to handle tends of billions of requests per month. We'll cover why we make that choice and the pros and cons of our approach.

From a standard Rails setup to more custom parts

AppSignal started out as a pretty standard Rails setup. We used a Rails app that collected data through an API endpoint which created Sidekiq jobs to process in the background.

After a while we replaced the Rails API with a Rack middleware to gain a bit of speed and later this was replaced with a Go web server that pushed Sidekiq compatible jobs to Redis.

App state and increments/updates

While this setup worked well for a long time, we began to run into issues where the databases couldn’t keep up with the amount of queries run against them. At this point we were processing tens of billions of requests already. The main reason for this was that each Sidekiq process needed to get the entire app's state from the database in order to increment the correct counters and update the right documents.

We could alleviate this somewhat with local caching of data, but because of the round-robin nature of our setup it still meant that each server needed to have a full cache of all data, because we couldn’t be sure on what server the payload would end up. We realised that with the data growth we were experiencing this setup would become impossible in the future.

Enter Kafka

In search for a better way to handle the data we settled on using Kafka as the data processing pipeline. Instead of aggregating metrics in the database, we now aggregate the metrics in Kafka processors. Our goal is that our Kafka pipeline never queries the database until the aggregated data has to be flushed. This drives the amount of queries per payload down from up to ten reads and writes to just one write at the end of the pipeline.

We specify a key for each Kafka message and Kafka guarantees that the same keys end up on the same partition, that's consumed by the same server. We use the app's ID as a key for messages, this means that instead of having a cache for all customers on the server, we only have to cache data for the apps a server receives from Kafka, not all apps.

Kafka is a great system and we’ve migrated over in the past two years. Right now almost all processing is done in Rust through Kafka, but there are still things that are easier done in Ruby, such as sending Notifications and other database-heavy tasks. This meant that we needed some way to get data from Kafka to our Rails stack.

Connecting Kafka and Ruby/Rails

When we began this transition there were a couple Kafka Ruby gems, but none worked with the latest (at the time 0.10.x) release of Kafka and most were unmaintained.

We looked at writing our own gem (which we eventually did). We will write more about that in a different article. But having a nice driver is only part of the requirements. We also needed a system to consume the data and execute the tasks in Ruby and spawn new workers when old ones crash.

Eventually we came up with a different solution. Our Kafka stack is built in Rust and we wrote a small binary that consumes a sidekiq_out topic and creates Sidekiq compatible jobs in Redis. This way we could deploy this binary on our worker machines and it would feed new jobs into Sidekiq just as you would do within Rails itself.

The binary has a few options such as limiting the amount of data in Redis to stop consuming the Kafka topic until the threshold is cleared. This way all the data from Kafka won’t end up in Redis' memory on the workers if there is a backlog.

From Ruby’s point of view, there is no difference at all between jobs generated in Rails and those that come from Kafka. It allows us to prototype new workers that get data from Kafka and process it in Rails–to send notifications and update the database–without having to know anything about Kafka.

It made the migration to Kafka easier as we could switch over to Kafka and back without having to deploy new Ruby code. It also made testing super easy as you could easily generate jobs in the test suite to be consumed by Ruby without having to setup an entire Kafka stack locally.

We use Protobuf to define all our (internal) messages, this way we can be pretty sure that if the test passes, the worker will correctly process jobs from Kafka.

In the end this solution saved us a lot of time and energy and made life a lot simpler for our Ruby team.

Pros and cons

As with everything there are a few pros and cons for this setup:

Pros:

No changes in Ruby required, API compatible
Easy to deploy and revert
Easy to switch between Kafka and Ruby
Redis isn’t overloaded by messages when using the limiter, saves memory on the server, keeping the messages in Kafka instead.
Horizontal scaling leads to smaller caches on each server, because of the keyed messages.

Cons:

Still has the issue that each Sidekiq thread needs access to a cache of all data for the apps from the partitions the server consumes. (e.g. Memcache).
Separate process running on the server
The rust processor commits the message offset when the message is flushed to Redis, this means that it’s guaranteed to be in Redis, but there’s no guarantee the message is processed by Ruby, this means that in case of a server crash, there is a chance some messages that were in Redis, but not processed are not processed.

Sidekiq and Kafka

Using Sidekiq helped us tremendously while migrating our processing pipeline to Kafka. We've now almost completely moved away from Sidekiq and handling everything via our Kafka driver directly, but that's for another article.

This is it for today. We hope you enjoyed this perspective on performance and scaling, and our experience scaling AppSignal. And follow us to keep an eye on when the next episode about Kafka is published.

Extending Existing Functionality In Rust With Traits In Rust

Robert Beekman — Tue, 23 Apr 2019 18:19:38 +0000

At AppSignal we use Protobuf to pass messages through Kafka. We picked this because we were already using Protobuf in other places in our stack and it works great for our use-case.

One of the benefits of Protobuf is that it generates Rust code based on the protocol definition, which we can extend through traits to add additional features.

A common thing we have to do in our processing pipeline is to merge two messages into one, e.g. merge two (count) metrics.

In this case we want to merge two Counter messages that look like this:

message Counter {
  int64 count = 1;
}

We can generate a Rust implementation of this protocol with protoc and extend this protocol using a trait.

A trait can be used to define functionality a type must provide. You can also implement default methods for a trait that can be overridden.

In this case we implement a default function for our CounterExt trait.

extern crate protobuf;

pub mod protocol;

use protocol::Counter;

pub trait CounterExt {
    fn merge(&mut self, to_merge: &Counter)
}

In the code above we use the protobuf crate and define the generated Rust code with protoc as a public module. We also use the Counter message we defined in the protocol. Then we define a new trait for the counter, called CounterExt.

This code defines a new function for CounterExt, called merge that accepts another counter to merge.

Next up we need to create a default implementation for this function.


impl CounterExt for Counter {
    fn merge(&mut self, to_merge: &Counter) {
        let our_count = self.get_count();
        self.set_count(our_count + to_merge.get_count());
    }
}

In this method we take the given counter and add it’s value to self.

Now that we have created this trait with a default implementation we can use it to merge two counters directly on the Protobuf generated code.

This means we can operate directly on deserialised Protobuf messages without having to convert them to structs or create a new message to contain the computed value.

use rdkafka::message::ProtobufMessage;

// Use the protocol Counter and the trait.
use protocol::protocol::Counter;
use protocol::CounterExt;

fn process_message(key: String, message: ProtobufMessage) {
    match cache.get_mut().entry(key) {
        // We have an entry, merge the counter
        Entry::Occupied(mut cache_entry) => {
            cache_entry.get_mut().merge(&message);
        },
        // No entry, insert it
        Entry::Vacant(cache_entry) => {
            cache_entry.insert(message);
        }
    }
}

The code above gets called for each Kafka message and updates a local cache with the merged value of the received message if it exists.

And it inserts the message into the cache if it doesn't already exist.

By extending our Protobuf messages with default traits we save ourselvs a lot of hassle in the message processing function.

Besides merging we implement a few other methods on our Protobuf messages that handle merging and computation of quantiles/percentiles/mean values.

Like this article or have any comments? Contact me on twitter or by hello@matsimitsu.com

Resize images from s3 with AWS Lambda and Rust

Robert Beekman — Sat, 09 Mar 2019 10:00:00 +0000

The very first iteration of my site didn’t have resized images and always showed the full 2200 pixels wide images, this was great on my local (desktop) machine, but when I tried visiting the site in a hotel in Cambodia, the site took ages to load.

Resizing the images locally worked fine, but took a lot of time and uploading an image in five different sizes on slow Wi-Fi took ages, if it worked at all.

I then switched to using an image resize proxy that took images from disk and re-sized them on the fly, caching the result in Nginx. This worked okay, but there was a tradeoff between server specs and monthly cost. Low specs meant that on an uncached page it took minutes before all images were resized, while high specs meant high monthly cost for a server that was idle 99% of the time.

The solution to not running a server was switching to imgix, it’s a great service that resizes images for you and does so with good quality and speed, but there’s a minimum fee of $10,00 a month, wether you use the service or not, and the costs go up pretty quickly as you add more and more photos. This is in addition to the S3 storage costs that imgix uses as the source for its proxy.

This lead me to the latest solution, use S3 to store the images (imgix also requires s3 as a source, so the images were already there) and AWS Lambda to resize the images on upload.

This means I only have to upload an image once and Lambda will take care of all the resized variants. I found a few (Javascript) solutions, and aws-lambda-image looked the easiest to use. This ran for a few months, before I decided to roll my own solution for a few reasons.

aws-lambda-image does a lot of magic and uses Claudia to manage the Lambda settings. While it works great, I don’t really like tools that require high-level access to AWS API’s and configure a lot for you automatically. I have no idea what’s happening after running the commands.

Another risk for me is that it Runs on Node, which eventually will require an upgrade at some point, which has a high risk of breaking the function and these things always happen at the most inconvenient times.

What I wanted is a single binary that just keeps working and requires no upkeep, configured by myself so I know what’s happening and ideally more efficient than the NodeJS solution. It’s also a great excuse to play with Rust some more and the just released Rust AWS Lambda Runtime.

The goal

I wanted something similar to the Javascript solution used. It should listen to events emitted when a file is uploaded to S3 and resize the image in several widths (360px, 720px, 1200px and 2200px).

Before we start, you can follow along with the complete source on GitHub

A binary project

Lets start by making a new Rust project, it should be a binary project and we need to make a few tweaks to the Cargo TOML to make sure Lambda can run the binary.

[package]
name = "lambda-image-resize-rust"
version = "0.1.0"
authors = ["Robert Beekman <robert@matsimitsu.nl>"]

[dependencies]
lambda_runtime = "0.1"

[[bin]]
name = "bootstrap"
path = "src/main.rs"

The way AWS Lambda works is that it starts the app/binary for you and then you have to call a certain endpoint from the app to receive new jobs to process. The lambda_runtime crate abstracts this process away and all you have to do is implement an event handler that will be called with the lambda! call.

Cargo (heh) culting from the example app, we start a logger and run the lambda for the AWS Runtime.

fn main() -> Result<(), Box<Error>> {
    simple_logger::init_with_level(log::Level::Info)?;

    lambda!(handle_event);

    Ok(())
}

The handle_event function will be called with the JSON result from the endpoint the runtime has called for us. Let’s convert this into a nice struct with Serde, by using the AWS-lambda-events crate.

Handle S3 events

This event contains one or more “records” that represent the S3 uploads it has received.

fn handle_event(event: Value, ctx: lambda::Context) -> Result<(), HandlerError> {
    let config = Config::new();

    let s3_event: S3Event =
        serde_json::from_value(event).map_err(|e| ctx.new_error(e.to_string().as_str()))?;

    for record in s3_event.records {
        handle_record(&config, record);
    }
    Ok(())
}

For each upload we have to get the file from S3, convert the file to one or more image variations and upload those back to S3 again. There are a couple of crates that implement the S3 API, I went with rust-s3 as it looked simple and small.

AWS Lambda sets a couple of default ENV vars, among those it sets AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, the rust-s3 crate can detect those with the Credentials::default() function.

In my case I want to store the images in the same bucket as the source, just in a different location, so I can use the information from the S3 event to determine the region and bucket.

fn handle_record(config: &Config, record: S3EventRecord) {
    let credentials = Credentials::default();
    let region: Region = record
        .aws_region
        .expect("Could not get region from record")
        .parse()
        .expect("Could not parse region from record");
    let bucket = Bucket::new(
        &record
            .s3
            .bucket
            .name
            .expect("Could not get bucket name from record"),
        region,
        credentials,
    );
    let source = record
        .s3
        .object
        .key
        .expect("Could not get key from object record");
}

Recursion

Now that we have all the required configuration to get and store images on S3, we have to do a sanity check first. We listen to an S3 event for uploaded images, but this function also uploads images to S3, this means that if you make a mistake in the configuration, it could send out an S3 event for each file you put back into the bucket.

This can mean that you’ll process your own (resized) images again, and since we generate more than one variant for each uploaded image. Combined with the power of Lambda and it's concurrency, it can mean that you’ll quickly generate thousands of new Lambda tasks, forcing you to hit the dreaded “panic button” in the Lambda UI, before you rack up an enormous AWS bill. (This may or may not come from my own experience ;)).

After resizing the image, we’ll append -<size> to the filename (e.g. foo.jpg becomes foo-360.jpg). To prevent this Lambda recursion we check the uploaded filename to see if it was already resized.

    /* Make sure we don't process files twice */
    for size in &config.sizes {
        let to_match = format!("-{}.jpg", size);
        if source.ends_with(&to_match) {
            warn!(
                "Source: '{}' ends with: '{}'. Skipping.",
                &source,
                &to_match
            );
            return;
        }
    }

Get images from- and upload to S3

Now that we know for sure we have the right image, let’s fetch it from S3 and load it from memory into the image crate.

    let (data, _) = bucket
        .get(&source)
        .expect(&format!("Could not get object: {}", &source));

    let img = image::load_from_memory(&data)
        .ok()
        .expect("Opening image failed");

Resize the images

With the image in memory we can resize it. Depending on the memory you give a Lambda function, you can get one or more CPU cores to your disposal. To get the maximum from our billed execution time, I opted to use the great rayon-rs crate to execute the resizes in parallel. All you have to do to process the image in parallel is to replace iter() with .par_iter(), awesome!

    let _: Vec<_> = config
        .sizes
        .par_iter()
        .map(|size| {
            let buffer = resize_image(&img, &size).expect("Could not resize image");

            let mut target = source.clone();
            for (rep_key, rep_val) in &config.replacements {
                target = target.replace(rep_key, rep_val);
            }
            target = target.replace(".jpg", &format!("-{}.jpg", size));
            let (_, code) = bucket
                .put(&target, &buffer, "image/jpeg")
                .expect(&format!("Could not upload object to :{}", &target));
            info!("Uploaded: {} with: {}", &target, &code);
        })
        .collect();

Another thing we do is loop through &config.replacemens, this is another feature to combat the recursion problem we have, by allowing us to replace certain parts of the (input) path of a file.

We can set a REPLACEMENTS env var with key/value strings, such as "original:resized".

With an input path of /original/trips/asia2018/img_01.jpg this will be converted to /resized/trips/asia2018/img_01.jpg. Combined with the input filter on the AWS Lambda configure page you can make sure converted images are never processed twice.

Finally we need to implement the actual resize function called in the code above.

it takes the image and a new width, calculates the ratio and generates the new needed height. We then call the image crate function and use the ImageOutputFormat::JPEG(90) ENUM to set the JPEG quality to 90 (from the default 75).

fn resize_image(img: &image::DynamicImage, new_w: &f32) -> Result<Vec<u8>, ImageError> {
    let mut result: Vec<u8> = Vec::new();

    let old_w = img.width() as f32;
    let old_h = img.height() as f32;
    let ratio = new_w / old_w;
    let new_h = (old_h * ratio).floor();

    let scaled = img.resize(*new_w as u32, new_h as u32, image::FilterType::Lanczos3);
    scaled.write_to(&mut result, ImageOutputFormat::JPEG(90))?;

    Ok(result)
}

You can find the complete project on GitHub.

Compiling for AWS Lambda

With a working binary we now need to (cross)compile it for the right environment/distribution. Luckily a person named softprops created a docker container that has all the tools we need to compile this binary to be used with the Lambda image.

    docker run --rm \
        -v ${PWD}:/code \
        -v ${HOME}/.cargo/registry:/root/.cargo/registry \
        -v ${HOME}/.cargo/git:/root/.cargo/git \
        softprops/lambda-rust

This will generate a boostrap.zip file in target/labmda/release. You can also get the boostrap.zip from te releases page.

Configuring Lambda

With a freshly compiled binary, we're nearly there. We need to do two things, configure a IAM role that allows the Lambda function to write logs and has access to the S3 bucket and configure the Lambda function itself.

Let’s start with the IAM Role, we’ll have to add two policies, one that allows the function to log and one that allows access to S3, it should look something like:

Bonus points if you lock the S3 role down a bit more, by not allowing it to remove items.

With a role configured, we can configure the lambda function, we have to set the SIZES and REPLACEMENTS ENV vars and I found that the function works best with at least 1024MB of memory assigned.

Attach the generated bootstrap.zip file and save the function.

Finally we need to configure the S3 events, pick “S3” events from the “Add triggers” section on the page and pick the option that says “All upload events”. I’ve also set the prefix/suffix option to prevent our recursion problem.

Save the function again and upload an image to test, if everything went well, it should generate resized images after the upload is complete. You can verify it works (or catch any errors) on AWS Cloudwatch, it should look something like this:

START RequestId: 7e7886d6-f983-4ef7-9916-83ab53874c6c Version: $LATEST
2019-03-09 15:07:35 INFO [lambda_runtime::runtime] Received new event with AWS request id: 7e7886d6-f983-4ef7-9916-83ab53874c6c
2019-03-09 15:07:35 INFO [bootstrap] Fetching: original-rust/blog/image-resize-rust/lambda-config.jpg, config: Config { sizes: [360.0, 720.0, 1200.0, 2200.0], replacements: [("original-rust", "r"), ("original", "r")] }
2019-03-09 15:07:36 INFO [bootstrap] Uploaded: r/blog/image-resize-rust/lambda-config-360.jpg with: 200
2019-03-09 15:07:36 INFO [bootstrap] Uploaded: r/blog/image-resize-rust/lambda-config-1200.jpg with: 200
2019-03-09 15:07:36 INFO [bootstrap] Uploaded: r/blog/image-resize-rust/lambda-config-720.jpg with: 200
2019-03-09 15:07:38 INFO [bootstrap] Uploaded: r/blog/image-resize-rust/lambda-config-2200.jpg with: 200
2019-03-09 15:07:38 INFO [lambda_runtime::runtime] Response for 7e7886d6-f983-4ef7-9916-83ab53874c6c accepted by Runtime API
END RequestId: 7e7886d6-f983-4ef7-9916-83ab53874c6c
REPORT RequestId: 7e7886d6-f983-4ef7-9916-83ab53874c6c  Init Duration: 86.53 ms Duration: 2944.40 ms    Billed Duration: 3100 ms Memory Size: 1024 MB   Max Memory Used: 125 MB

Future goals

You can find the source on GitHub and a ready-to-go bootstrap.zip on the relase page.

The binary works great and has resized many images already. With Amazon's generous free Lambda tier, resizing all the images on my blog has cost me a grand total of $0.61. There is room for improvement, however. Error handling can be a lot nicer than .expect() everywhere, though as long as it logs the error in CloudWatch it works for me right now.

It would be nice if it could handle more image formats, while the image crate works fine with input formats such as GIF, JPEG, PNG and WEBP, right now I only generate JPEG images. I like it to generate WEBP images along side the JPEGs but I couldn’t find any crate that can generate WEBP images. If you happen to know one or have other feedback on this post, please let me know by email or tweet me.

References / Resources

Don't be mean: Statistical means and percentiles 101

Robert Beekman — Tue, 04 Dec 2018 13:32:40 +0000

Performance monitoring is an important part of running a successful application. One of the most basic ways to tell the performance of something is to measure the duration each time it happens and distill statistics from it.

Mean

The mean or average of a collection of values is a good start to see how good or bad something behaves. It is calculated by summing all the values under consideration and then dividing by the number of occurrences.

In Ruby, this is what calculating the mean response time would look like:

def mean(array)
 (array.sum.to_f / array.length).round(2)
end

durations = [1,2,3,4,5,6,7,8,9,0]
mean(durations) #=> 4.5

Note: In the example, for a more accurate result when dividing, we cast the total duration value to a Float. Otherwise, Ruby would round down to the nearest Integer, returning 4 instead.

Median

Another useful statistic is the median. While it sounds similar, there’s a difference between the mean and median of a collection of values.

The median is the value separating the upper half of a set from the lower half of the set.

For a dataset with an odd number of values, you get the median by first sorting the values, then selecting the middle number. For a set with an even number of values, after sorting them, the median will be the mean of the two middle numbers.

def median(array)
  sorted_array = array.sort
  length = sorted_array.length

  if length.odd? # Middle number for odd arrays
    sorted_array[length / 2]
  else # Mean of two middle numbers
    first_value = sorted_array[length / 2]
    second_value = sorted_array[length / 2 - 1]
    (first_value + second_value) / 2.to_f
  end
end

# Even array
durations = [1,2,3,4,5,6,7,8,9,0]
median(durations) #=> 4.5

# Odd array
durations = [1,1,2,3,4,5,6,7,8,9,0]
median(durations) #=> 4

This statistic is a good way of seeing if there is a huge skew in data or a long tail.

durations = [1,2,3,4,5,2000]

median(durations) #=> 3.5
mean(durations) #=> 335.83

The mean for the durations above would be 335.83 because of the single outlier of 2000ms. The median, which is only 3.5, indicates that there is a skew.

By calculating both the mean and median of a dataset, you can figure out if there are any large outliers or a long tail.

The Problem with Mean

While mean and median are good indicators of performance, they don’t tell the whole story. If you request a webpage ten times, the mean could be very low, but one or more requests can still take a very long time to complete.

The image below shows the 99th (blue) and 90th (green) percentiles and the mean (red) for a certain action in AppSignal. You can see that the 99th and 90th are quite far from the mean and there are some spikes. This means that while your average customer has a good experience, every once in a while there's a user who has to wait almost twice as long for the page to render. Ideally, you would want to get all these values as close to each other as possible, creating a more consistent experience for all your users.

For example, given the following duration set where 10 customers request a page with a duration between 100 milliseconds and 1 second.

[100,100,100,100,100,100,100,100,100,1_000]

This would result in a mean of just 190ms while one user had a very bad experience of a 1 second response time. When only tracking the mean, it's easier to think your website has great performance, while in reality every once in a while a user has a terrible experience.

The example above is only for 10 requests, but imagine if you had a thousand requests per day, that would mean a hundred of those users had a terrible experience.

Percentiles

To give a better idea of the distribution of the values, we use percentiles. Percentiles are similar to the median - a number that signifies a point in the dataset where half of the set is below the number and half of it is above. Percentiles are similar in the sense that the 20th percentile means that 20% of the numbers in the dataset are below that number.

Given the following (sorted) set:

[100,100,200,200,300,300,400,400,500,5_000]

If we wanted to know the 20th percentile, we can calculate it in the following way: There are 10 values in the set. The wanted value is at position 1 (20.0 / 100 * 10 - 1) as our arrays start at zero. Since this array contains an even amount of items, we have to calculate the mean between the index (2) and index + 1 (3). This would result in a value of 150 for the 20th percentile.

A very naive Ruby implementation would look like this:

def percentile(array, wanted_percentile)
  sorted_array = array.sort

  index = (wanted_percentile.to_f / 100) * sorted_array.length - 1

  # Check if index is not a round number
  if index != index.to_i
    sorted_array.at(index.ceil)
  elsif sorted_array.length.even?
    first_value = sorted_array.at(index)
    second_value = sorted_array.at(index + 1)
    (first_value + second_value) / 2
  else
    sorted_array.at(index)
  end
end

# An array with an odd amount of numbers
durations = [100,200,200,300,300,400,400,500,5_000]

percentile(durations, 20) #=> 100
percentile(durations, 90) #=> 500
percentile(durations, 95) #=> 5000, index is a fraction, 9.5 the rounded index is 10

# An array with an even amount of numbers
durations = [100,100,200,200,300,300,400,400,500,5_000]

percentile(durations, 20) #=> 150, average of index 1 & 2 `(100 + 200) / 2`
percentile(durations, 90) #=> 2750, average of index 8 & 9 `(500 + 5000) / 2
percentile(durations, 95) #=> 500, index is a fraction, 8.55 the index is 9

This percentile function looks very similar to our median calculation and in fact, the median is the same as the 50th percentile.

durations = [1,2,3]

percentile(durations, 50) == median(durations) #=> true

AppSignal uses the statistics above to generate performance metrics for your Application. We do not just rely on the mean/average but calculate the 90th and 95th percentiles to show outliers that give a better idea of the distribution of your requests. Find out more on our performance tour page.

Oddities

Because of the way percentiles and averages are calculated, it’s sometimes possible to have the 90th percentile dip below the mean, for example, given the following dataset:

durations = [1,1,1,1,1,1,1,1,1,1,2000]

percentile(durations, 90) #=> 1
mean(durations) #=> 182.73

This would give us a mean of 182.73, and a 90th percentile of just 1.

If your metric collection system only shows the 90th percentile and the mean, you’d still be able to deduce that there’s a huge outlier somewhere in your dataset if the 90th percentile drops below the average.

You are almost at 100% of this post

That's it for now! In another post, we're going to talk about how we efficiently store and calculate percentiles for all our customer's requests using Quantiles. If you have any questions or remarks about statistics and APMs, error
tracking or performance monitoring, hit us up on Twitter @AppSignal or via email.

The innards of a RubyGem

Robert Beekman — Tue, 23 Oct 2018 13:26:21 +0000

Gather ’round children, and let grandpa recount the ways of the old days when life was hard, and installing gems was a headache-inducing, hair-pulling, teeth-gritting ordeal.

Back when I was just starting in Ruby, there was no Bundler and gems had to be installed the hard way. In Rails, this meant running rake gems:install a million times, fixing occurring bugs along the way, until the command passed with no errors. Today, we’re going to create a gem the old school way, after looking into what gems are and how they work.

Gems, What Are They?

RubyGems are an easy way to extend your own code with functionality written by other people. For example, instead of writing your own authentication/authorization code, you can use Devise, or if you want to re-size uploaded images you can use CarrierWave. This allows you to write reusable code that you can share with other people.

But How Do They Work?

In its most basic form, a gem is nothing more than a zipped-up directory containing code and a <name>.gemspec file. This .gemspec file contains metadata about the gem such as its name, what files to load and its dependencies.

The gem install or bundle command downloads the zip file from the source and extracts it to your hard drive. You can find out where a gem is located by running bundle info <gem name> or by directly opening the gem directory by running bundle open <gem name>.

To load the gem into your application, Rubygems monkey-patches the require function in the Kernel class. It first tries to read the file from disk and if that doesn’t work, it then tries to resolve the file in each of the gems on your system. Once it finds the file in a gem it “activates” the gem by adding it to the load path.

If you use Bundler, it adds each specific gem to the load path during the setup call. This saves Rubygems the hassle of trying to resolve the paths. It also prevents Ruby from loading a different version of the gem than is selected in the Gemfile(.lock).

How Can I Make One?

The easiest way to create your own gem is to use Bundler to generate a gem scaffold. This includes a proper directory structure, license, code of conduct and a test environment for the gem.

However, today we’re going to create our own minimalistic gem with just two files, one containing the code and a gemspec file that contains the metadata. Our gem will greet the user when called. Let’s start by creating a directory for our gem.

mkdir howdy
cd howdy

In this directory, we’ll create a lib folder that will contain the code and a howdy.gemspec file that will contain the metadata. It should look something like this:

tree
.
├── howdy.gemspec
└── lib
    └── howdy.rb

Our howdy gem has the following code:

lib/howdy.rb

class Howdy
  def greet
    "howdy!"
  end
end

A minimalistic howdy.gemspec file contains information about the version, author, etc. It also specifies the files to keep when building a gem. This prevents the users of the gem from having to download unnecessary files such as tests and other files that aren't needed to run the gem code.

Gem::Specification.new do |spec|
  spec.name          = "howdy"
  spec.version       = "0.0.1"
  spec.authors       = ["Robert Beekman"]
  spec.email         = ["robert@example.com"]

  spec.summary       = %(Greets the user)
  spec.description   = %(Howdy is a gem that greets the user when called)
  spec.license       = "MIT"

  spec.files         = ["lib/howdy.rb"]
end

To build the gem we can use the gem build howdy.gemspec command. It generates a howdy-0.0.1.gem file containing your code. To make the gem available to other people, you can publish it to rubygems.org with the gem publish command.

Recap

These are the steps needed to create and publish a very basic gem. We hope you enjoyed us diving into the archeology of gems, and the old school way of making them. As mentioned before this was for educational purposes; we recommend using Bundler to generate a gem scaffold in today's world.

Peace out, youngsters! If you have any ideas, questions or comments, please don't hesitate to leave a comment.

Custom Exceptions in Ruby

Robert Beekman — Tue, 03 Jul 2018 12:44:38 +0000

A little while ago we talked about exceptions in Ruby. This time we explore ways of creating custom exceptions specific to your app’s needs.

Let's say we have a method that handles the uploading of images while only allowing JPEG images that are between 100 Kilobytes and 10 Megabytes. To enforce these rules we raise an exception every time an image violates them.

class ImageHandler
  def self.handle_upload(image)
    raise "Image is too big" if image.size > 10.megabytes
    raise "Image is too small" if image.size < 100.kilobytes
    raise "Image is not a JPEG" unless %w[JPG JPEG].include?(image.extension)

    #… do stuff
  end
end

Every time a user uploads an image that doesn't meet the rules, our (Rails) web app displays the default Rails 502 error page for the uncaught error.

class ImageUploadController < ApplicationController
  def upload
    @image = params[:image]
    ImageHandler.handle_upload(@image)

    redirect_to :index, :notice => "Image upload success!"
  end
end

The Rails generic error page doesn't offer the user much help, so let's see if we can improve on these errors. We have two goals: inform the user when the file size is outside the set bounds and prevent hackers from uploading potentially malicious (non-JPEG) files, by returning a 403 forbidden status code.

Custom error types

Almost everything in Ruby is an object, and errors are no exception. This means that we can subclass from any error class and create our own. We can use these custom error types in our handle_upload method for different validations.

class ImageHandler
  # Domain specific errors
  class ImageExtensionError < StandardError; end
  class ImageTooBigError < StandardError
    def message
      "Image is too big"
    end
  end
  class ImageTooSmallError < StandardError
    def message
      "Image is too small"
    end
  end

  def self.handle_upload(image)
    raise ImageTooBigError if image.size > 10.megabytes
    raise ImageTooSmallError if image.size < 100.kilobytes
    raise ImageExtensionError unless %w[JPG JPEG].include?(image.extension)

    #… do stuff
  end
end

First, we've added three new classes to the handler that extend from StandardError. For the image size errors, we've overridden the message method of StandardError with an error message we can show to users. The way raise was called in the handle_upload method has also changed, by replacing the custom StandardError message with a different error type we can raise a different, more specific, error.

Now, we can use these custom error types in our controller to return different responses to errors. For instance, we can return the specific error message or a specific response code.

class ImageUploadController < ApplicationController
  def upload
    @image = params[:image]
    ImageHandler.handle_upload(@image)

    redirect_to :index, :notice => "Image upload success!"

  rescue ImageHandler::ImageTooBigError, ImageHandler::ImageTooSmallError => e
    render "edit", :alert => "Error: #{e.message}"

  rescue ImageHandler::ImageExtensionError
    head :forbidden
  end
end

This is already a lot better than using the standard raise calls. With a little bit more subclassing we can make it make it easier to use, by rescuing entire error groups rather than every error type separately.

class ImageHandler
  class ImageExtensionError < StandardError; end
  class ImageDimensionError < StandardError; end
  class ImageTooBigError < ImageDimensionError
    def message
      "Image is too big"
    end
  end
  class ImageTooSmallError < ImageDimensionError
    def message
      "Image is too small"
    end
  end

  def self.handle_upload(image)
    raise ImageTooBigError if image.size > 10.megabytes
    raise ImageTooSmallError if image.size < 100.kilobytes
    raise ImageExtensionError unless %w(JPG JPEG).include?(image.extension)

    #… do stuff
  end
end

Instead of rescuing every separate image dimension exception, we can now rescue the parent class ImageDimensionError. This will rescue both our ImageTooBigError and ImageTooSmallError.

class ImageUploadController < ApplicationController
  def upload
    @image = params[:image]
    ImageHandler.handle_upload(@image)

    redirect_to :index, :notice => "Image upload success!"

  rescue ImageHandler::ImageDimensionError => e
    render "edit", :alert => "Error: #{e.message}"

  rescue ImageHandler::ImageExtensionError
    head :forbidden
  end
end

The most common case for using your own error classes is when you write a gem. The mongo-ruby-driver gem is a good example of the use of custom errors. Each operation that could result in an exception has its own exception class, making it easier to handle specific use cases and generate clear exception messages and classes.

Another advantage of using custom exception classes is that when using exception monitoring tools like AppSignal. These tools give you a better idea as to where exceptions occurred, as well as grouping similar errors in the user interface.

If you liked this article, check out more of what we wrote on AppSignal Academy. AppSignal is all about building better apps. In our Academy series, we'll explore application stability and performance, and explain core programming concepts.

We'd love to know what you thought of this article, or if you have any questions. We're always on the lookout for topics to investigate and explain, so if there's anything magical in Ruby you'd like to read about, don't hesitate to leave a comment.