<![CDATA[Ned McClain]]>https://www.nedmcclain.com/https://www.nedmcclain.com/favicon.pngNed McClainhttps://www.nedmcclain.com/Ghost 5.79Thu, 14 Mar 2024 01:19:44 GMT60<![CDATA[Community Formbricks Helm Chart]]>I have two clients using Formbricks for everything from simple contact forms to full-blown "in-app experience management". One of them needed a Helm chart to deploy to their Kubernetes environment, and they were happy to let me share:

GitHub - nmcclain/formbricks-helm: Helm chart for deploying Formbricks on
]]>
https://www.nedmcclain.com/community-formbricks-helm-chart/65d268d87d1cbc00013baf8fTue, 06 Feb 2024 20:41:00 GMT

I have two clients using Formbricks for everything from simple contact forms to full-blown "in-app experience management". One of them needed a Helm chart to deploy to their Kubernetes environment, and they were happy to let me share:

GitHub - nmcclain/formbricks-helm: Helm chart for deploying Formbricks on Kubernetes
Helm chart for deploying Formbricks on Kubernetes. Contribute to nmcclain/formbricks-helm development by creating an account on GitHub.
Community Formbricks Helm Chart

Formbricks rocks. I love it for the Open Source license and the fact the only dependency is Postgres.

Users love it for the rich widgets, beautiful mobile-first UI, and large library of best-practices surveys.

Product managers love it for the no-code triggers, powerful segmentation options, and slick reporting.

I think you'll love it, too.

💡
Hey, we all want to see Formbricks succeed and prove out the COSS business model. Please use this chart for anything but hosting a competing product.

Quickstart

Configuration

At a minimum, you have to set these two options in a Helm values file:

formbricks:
  webapp_url: http://localhost:3000
  nextauth_url: http://localhost:3000

All options are defined in the values.yaml file. Formbricks configuration documentation can be found here.

Create formbricks secret

  • Create formbricks namespace: kubectl create ns formbricks
  • Execute the following command after updating the Postgres connection string:
kubectl create secret -n formbricks generic formbricks-secrets \
	--from-literal=database_url='postgresql://formbricks:CHANGE_ME@postgres:5432/formbricks?schema=public' \
	--from-literal=nextauth_secret="`openssl rand -hex 32`" \
	--from-literal=encryption_key="`openssl rand -hex 32`"

Install Chart

helm upgrade --install -n formbricks formbricks oci://ghcr.io/nmcclain/formbricks/formbricks --version 0.1.6

Next steps

More documentation can be found in the README.

Running into issues? Missing features? Please open a GitHub issue!

If you're facing challenging Kubernetes or ops problems, I'm focused where software, infrastructure, and culture intersect. Get in touch here!

]]>
<![CDATA[Enhance Your Grafana ❤️ ArgoCD Integration]]>ArgoCD and Grafana have been key building blocks in a big chunk of the Kubernetes environments I've worked with. ArgoCD provides continuous app delivery, while Grafana acts as the "single pane of glass" for ops/dev/product team observability.

Argo CD follows the GitOps 
]]>
https://www.nedmcclain.com/enhance-your-grafana-argocd-integration/65cea2bb7d1cbc00013bad8dFri, 02 Feb 2024 16:07:00 GMT

ArgoCD and Grafana have been key building blocks in a big chunk of the Kubernetes environments I've worked with. ArgoCD provides continuous app delivery, while Grafana acts as the "single pane of glass" for ops/dev/product team observability.

Argo CD follows the GitOps pattern of using Git repositories as the source of truth for defining the desired application state.

Most teams have some metrics-based ArgoCD dashboards in Grafana.

This post aims to inspire you to maximize the value of these tools by exposing ArgoCD API data in Grafana.

Give non-technical users an answer to the perennial question of "what's running where" with Grafana and ArgoCD.

How to Connect Grafana with the ArgoCD API

ArgoCD has built-in support for Prometheus compatible metrics. The vendor-provided dashboards are beautiful, covering visibility over your Applications/ApplicationSets and other ArgoCD objects.

Unfortunately, the metric labels are not quite sufficient to fully answer the question "what version of each service is running where?" For that, we need to display the actual git hashes and image versions running on each pod.

This is easy to do by connecting a JSON datasource for Grafana to the ArgoCD REST API. Here's what it looks like in your Grafana dashboard:

Enhance Your Grafana ❤️ ArgoCD Integration

Implement it yourself in these four steps:

1. ArgoCD role and JWT

To allow Grafana to access the ArgoCD API, you must create a Project Role and issue it a JWT.

It's always a good idea to avoid the default project in ArgoCD. You should create deliberate projects that are security-scoped to only the necessary source repos, destination clusters, and cluster resources.

Inside the desired project, click the Add Role button. There, you can populate a new role with read-only permissions. Use your project name in place of blog-project below:

Enhance Your Grafana ❤️ ArgoCD Integration

Now you can issue a JWT for this role, which will be used by Grafana. Click your new role, scroll to the bottom, then create a token. Use an expiry if appropriate for your organization's policies.

Enhance Your Grafana ❤️ ArgoCD Integration

Scroll down to see the new token. Save this for Step 3 - you won't be able to see it again.

Enhance Your Grafana ❤️ ArgoCD Integration

2. Install Infinity data source plugin

The Infinity plugin for Grafana can read data from a variety of sources, including ArgoCD's JSON API responses. There are three ways to install it:

  • Using the CLI, run: grafana-cli plugins install yesoreyeram-infinity-datasource
  • In the Helm chart, add - yesoreyeram-infinity-datasource to grafana.plugins.
  • From the web UI:
    • Click on Administration -> plugins.
    • Switch state from Installed to All, then search for "infinity".
    • Click on "Infinity, by Sriramajeyam Sugumaran".
    • Click the "Install" button in the top-right corner.

3. Configure Infinity data source

  • In the web UI, navigate to Connections -> Add new connection and click on "Infinity".
  • Click the Add new data source button.
  • Click the Authentication tab, then pick Bearer Token. We'll use the JWT from our ArgoCD as a Bearer token in the Authentication header - paste it here.
Enhance Your Grafana ❤️ ArgoCD Integration
  • Add your ArgoCD Server URL under Allowed hosts. Since this example is deployed to Kubernetes, and both Grafana and ArgoCD are in the same namespace, I simply used https://argocd-server.
Enhance Your Grafana ❤️ ArgoCD Integration
  • Click the blue Save & Test button.

4. Panel setup

  • Finally, go to your dashboard (or create a new one) and add a new panel.
  • Pick the Table visualization type (although you can experiment with others like State and Gauge).
  • Select the Infinity datasource you configured in step 3, set: Type=JSON, Parser=Backend, Source=URL, Format=Table, Method=GET, and adjust the URL to https://YOUR-argocd-server/api/v1/applications:
Enhance Your Grafana ❤️ ArgoCD Integration
  • If the connection to ArgoCD is working, you should see some messy JSON in your table. To parse it, click Parsing options & Result fields and add this JSON selector to the Rows/Root field: $[].items
Enhance Your Grafana ❤️ ArgoCD Integration
  • Now you can Add Columns with the JSON selector and column titles you want. Here are some examples:
metadata.name
metadata.namespace
status.health.status
status.history.0.revision [git hash]
status.operationState.finishedAt
status.operationState.message
status.operationState.syncResult.source.path
status.operationState.syncResult.source.repoURL
status.operationState.syncResult.source.targetRevision [git branch or tag]
status.summary.images
status.sync.status
  • Click Apply, save your dashboard, and enjoy!

Other Quick Wins

For a more visually pleasing display, consider the Dynamic Text Panel maintained by the fine folks at Volkov Labs. This panel will layout the ArgoCD JSON data using HTML or markdown.

Enhance Your Grafana ❤️ ArgoCD Integration

If you're not collecting ArgoCD metrics yet, do it! Using the ArgoCD helm chart, you need to set this value to enable the metrics exporter: controller.metrics.enabled: true.

Then, configure your Prometheus or VictoriaMetrics instance to scrape the ArgoCD endpoint. When using the ArgoCD Helm chart, you can choose to either enable ServiceMonitor resources (preferred) or Prometheus scrape annotations and these services will be automatically discovered.

It's also worth enabling the API, Repo Server, and Dex metrics endpoints. They expose details about requests to the Redis backend, git repositories, the authentication server. Set these Helm values to enable all metrics:

dex.metrics.enabled: true
redis.metrics.enabled: true # or see values file if using redis-ha
server.metrics.enabled: true
repoServer.metrics.enabled: true
applicationSet.metrics.enabled: true
notifications.metrics.enabled: true

Finally, Kubernetes labels on ArgoCD Applications can be optionally converted to Prometheus labels. This is super-useful for breaking down dashboards and alerts based on app owners, projects, etc. You need to specify the list of labels that are meaningful to your org. The key reason to not include certain labels is if they are high-cardinality, like a node identifier or git hash. Set these Helm values:

controller:
  metrics:
    enabled: true
    applicationLabels:
      enabled: true
      labels: ["business_unit", "team_name"]

Alerts

In my experience, alerts are 10x as valuable as dashboards. It'd be an easy choice if I could only pick one. This could be a blog post on its own, but here are a few quick tips:

Enhance Your Grafana ❤️ ArgoCD Integration
Via https://argocd-notifications.readthedocs.io/en/stable/services/grafana/
  • Sources for example ArgoCD Prometheus alert rules: Awesome Prometheus Alerts and ArgoCD Grafana Mixin rules by Adin Hodovic.
  • Consider alerting more on SLO-driven error budgets rather than individual infrastructure components [see: Google SRE book, workbook].
  • Make a thoughtful organizational decision of where alert rules will live. Grafana-native alerts have the benefit of an amazing UI, but Prometheus alerts can be managed with IaC and have a better HA story.

IaC

All of this Grafana and ArgoCD configuration can be captured as code, with one exception: creating the JWT key in ArgoCD. For this you must use the UI or API.

For Helm users, here are some simple examples used for this post:

If you're not using Helm, this Grafana blog post outlines ways to deploy dashboards and datasources "as code," including with Terraform, Ansible, Crossplane, and the Grafana Operator for Kubernetes.

Alternative Approach

ArgoCD's web dashboard is rich with data. Why do we care to display it in Grafana? Often, stakeholders will already have access to Grafana, but not ArgoCD or kubectl.

If that's not the case in your world, it might make sense to display some basic metrics directly in ArgoCD instead. Use the ArgoCD Metrics Extension to achieve this.

Closing Thoughts

When you say "Grafana," most people think "metrics". Or maybe "metrics, logs and traces". But the truth is there are dozens of amazing Grafana data sources that can be used to pull in data from across your tech ecosystem.

Does your app have an admin API? By all means use Grafana to build alerts to make that data actionable. Got some SalesForce in your world? Pull in critical data via JSON to display in your dashboards. Correlate infra-level metrics with business ones.

I hope this post inspires you to think outside the box when it comes to having all your data visible in one place.

If you're seeking a partner to assist your Platform, DevOps, or SRE team with tough problems, I'm focused where software, infrastructure, and culture intersect. Get in touch here!

Cover photo: Ames power station, Ophir, Colorado - influence from Nunn, Westinghouse, and Nikola Tesla. Via: Jay8g on Wikipedia.

]]>
<![CDATA[PostgREST on Fargate]]>https://www.nedmcclain.com/postgrest-rest-api-on-aws-fargate/65caf591b81cfc00016649afThu, 21 Jan 2021 18:12:51 GMT

Launching an MVP is the only way to get real user feedback on your startup. Rather than build out a custom backend, PostgREST exposes your Postgres DB tables as a REST API. Put those first months of engineering focus into the frontend and unique business logic - save the custom backend for when you need to scale.

Using PostgREST is an alternative to manual CRUD programming. Custom API servers suffer problems. Writing business logic often duplicates, ignores or hobbles database structure. Object-relational mapping is a leaky abstraction leading to slow imperative code. The PostgREST philosophy establishes a single declarative source of truth: the data itself. - https://postgrest.org/

AWS Fargate is an ideal compute platform for PostgREST. Being serverless, it is very efficient in terms of cost and operations. Yet Fargate containers are relatively long-lived compared to AWS Lambda - this allows PostgREST's caching mechanisms to remain effective.

Prerequisites

  • You'll need an AWS account and its associated access/secret keys.
  • Install the AWS CLI [directions here], and configure your AWS keys [directions here].
  • The Fargate CLI is the easiest way to deploy a container on AWS. You'll want to install it [download here].
  • A Postgres database. Your Fargate task(s) will need network access to this database. If you are using RDS, be sure to specify subnets and security groups for your load balancer and tasks (see below) so they get deployed in the same VPC as your database.

Ship it

1. Setup a test DB, API, role, and table in Postgres:

Connect to Postgres, then create a database for this exercise:

postgres=> create database startup;
CREATE DATABASE
postgres=> \c startup;
psql ...
You are now connected to database "startup" as user "root".
startup=>

PostgREST uses a "naked schema" to identify which DB tables should be exposed in the API. We can create one, and add a sample table:

startup=> create schema api;
CREATE SCHEMA

startup=> create table api.trees (id serial primary key, name text not null, species text not null);
CREATE TABLE

startup=> insert into api.trees (name, species) values ('Banyan', 'Ficus benghalensis'), ('Quaking Aspen', 'Populus tremula'), ('American Elm', 'Ulmus americana'), ('Red Maple', 'Acer rubrum');
INSERT 0 4

Finally, we need to set up two roles. The first controls anonymous access with Postgres' standard grants and (optionally) row-level security. PostgREST uses the authenticator role to connect to the database. Once connected, PostgREST assumes the web_anon role. Please pick a better password than 'secret1'.

create role web_anon nologin;

grant usage on schema api to web_anon;
grant select on api.trees to web_anon;

create role authenticator noinherit login password 'secret1';
grant web_anon to authenticator;

This structure ensures you are not using the root user for PostgREST connections and makes it easy to add JWT-authenticated roles in the future.

A more detailed PostgREST setup tutorial is available here: https://postgrest.org/en/v7.0.0/tutorials/tut0.html

2. Deploy an Application Load Balancer to route traffic to your Fargate task(s):

fargate lb create postgrest --port HTTP:80
 ℹ️  Created load balancer postgrest

Let's break down this command:

  • fargate lb create postgrest: create an ALB named postgrest.
  • --port HTTP:80: the load balancer will accept internet requests on port 80.
  • If you've deleted your default VPC, or wish to use a different VPC, be sure to specify the appropriate subnets (this flag can be used multiple times):
    --subnet-id subnet-0ae7XXX
  • You can also optionally specify multiple security groups with the --security-group-id sg-02ddXXX flag.

3. Deploy a PostgREST task:

This command will deploy the PostgREST image to a Fargate task. It automatically creates a new ECS cluster and a default IAM role for task execution.

fargate service create postgrest \
  --lb postgrest --port HTTP:3000 \
  --image 'registry.hub.docker.com/postgrest/postgrest' \
  --num 1 \
  --env PGRST_DB_URI='postgres://authenticator:secret1@HOST/startup' \
  --env PGRST_DB_SCHEMA='api' \
  --env PGRST_DB_ANON_ROLE='web_anon'
  ℹ️ Created service postgrest

A quick review of the flags we used:

  • fargate service create postgrest: create a postgrest Fargate service in a new ECS cluster.
  • --lb postgrest --port HTTP:3000: connect this service to your load balancer.
  • --image 'registry.hub.docker.com/postgrest/postgrest': deploy the postgrest image from dockerhub.
  • --env PGRST_DB_URI='postgres://authenticator:secret1@HOST/startup': set URI for access to your database. Note this is the username/password/db you setup in step 1.
  • --env PGRST_DB_SCHEMA='api': name the API you created in step 1.
  • --env PGRST_DB_ANON_ROLE='web_anon': name the role you created in step 1.

4. Test

If things go as planned, you'll be able to find your load balancer hostname with:

$ fargate lb list | grep postgrest
postgrest	Application	Active	
postgrest-132447124.us-east-1.elb.amazonaws.com	HTTP:80

Now, you can query your new API:

curl postgrest-132447124.us-east-1.elb.amazonaws.com/trees
[{"id":1,"name":"Banyan","species":"Ficus benghalensis"},
 {"id":2,"name":"Quaking Aspen","species":"Populus tremula"},
 {"id":3,"name":"American Elm","species":"Ulmus americana"},
 {"id":4,"name":"Red Maple","species":"Acer rubrum"}]

Cost

AWS's "free tier" doesn't include Fargate, so here’s the bottom line: it will cost you $2.83 to run PostgREST on a Fargate for one month, while easily serving 500 requests/second. This is with the smallest single instance available - there is tons of room to scale both vertically and horizontally with this architecture.

Caution: the load balancer costs are more significant (~$16+/mo), but they are included in the "free tier" and can be shared between many Fargate (and Lambda/EC2) services.

Next Steps

  • Looking for more than a read-only API? You can enable authenticated access to read and write private information in your database. PostgREST supports JWT authentication, which maps to Postgres roles you configure. You can even enable row-level security. Tutorial is here, with further details here.
  • Please enable TLS (SSL) on your load balancer, and require encrypted connections to PostgREST. We don't want your credentials leaking!
  • In production, you might use an infrastructure-as-code tool like Terraform or CDK to deploy your Fargate tasks. These tools offer more control and customization than the Fargate CLI.
  • Want more content like this? Follow me on Twitter at @nedmcclain. Struggling to get PostgREST deployed on Fargate? DM me!

Thanks Nalin!

Hat tip to my dear friend @nalin, for encouraging me to write this post!

PostgREST on Fargate

Cover photo by Kevin Bree on Unsplash.

]]>
<![CDATA[Deploy your own GeoIP API in 6 minutes: video]]>https://www.nedmcclain.com/geoip-api-in-6-minutes/65caf591b81cfc00016649aeMon, 09 Nov 2020 22:34:06 GMT

Geo IP lookups are used all over the place, from analytics, to marketing, to localization, and even fraud prevention.

There are dozens of free and commercial services in the space, and they might be a good fit for you. But many teams will decide they need to host their own Geolocation API to meet security, privacy, performance, or availability guarantees.

I recently helped a client deploy their own Geo IP lookup service, and wanted to show you how to do it too. You can get this setup in a matter of minutes, with just one open source library and using AWS Lambda for serverless hosting. The Serverless Framework makes deployment a breeze.

Want to do it yourself?

Check out the video below and you might take my next paid gig!

Please join me at @nedmcclain for more posts and videos on serverless, infrastructure, and operations!

Photo by Andrew Coelho on Unsplash

]]>
<![CDATA[Static Hosting a Ghost Site]]>https://www.nedmcclain.com/static-ghost-site/65caf591b81cfc00016649a3Mon, 19 Oct 2020 17:03:00 GMT

Do you truly control your content if you post it on Medium?

Static Hosting a Ghost Site

At the beginning of this year, I switched from Medium to a static site powered by Ghost. Although Ghost(Pro) is a strong choice if you have paying subscribers, it is impossible to beat static hosting for a simple blog. Static sites are fast, secure, and free.

Here's a quick rundown of my reasoning and approach - from editing in Ghost's world-class UI to deploying for free on a static hosting provider like Netlify.

Requirements

I was looking for a blogging solution that met these requirements:

  • Slick UI: Even though I use vim and markdown daily, I prefer a Medium-like freeform writing experience.
  • Self-hosted: Full control of my content with no vendor lock-in, paywalls, or subscription pop-ups.
  • High availability: I should never receive a monitoring alert for my personal blog.
  • Exceptional performance: I want the page to load fast!
  • Secure and low-maintenance: After supporting dozens of WordPress sites over more than a decade, I have grown weary of hacked plugins and frequent patching cycles.

Static Ghost.io 👻

Even though I use markdown every day in my work, I prefer the WYSIWYG writing experience of a web UI. Ghost's UI is fantastic!

Static Hosting a Ghost Site

The main drawback of Ghost is the same as WordPress: every request depends on a database. Each time a user visits a page on the site, the following steps take place:

  1. The web server passes the end user's request to a running application server (Node.js for Ghost, PHP for WordPress).
  2. The application server analyzes the incoming request and dispatches requests to the database for necessary content (MySQL for Ghost).
  3. The application server renders the content from the database into a web page.
  4. The web server passes the resulting page back to the end user.

This architecture requires significant compute and memory resources on both the application server and database server. It is susceptible to performance issues due to spikes in demand. And achieving high availability requires a database cluster and web load balancer. Lots of distinct parts to deploy, secure, and operate.

The advent of the JAMstack architecture has brought static sites into the limelight. By moving steps 2 and 3 to the beginning, Ghost can generate each page on the site in advance. Then, it's easy to upload the static site to a free or almost-free hosting provider (see below for suggestions).

Fast and secure sites and apps delivered by pre-rendering files and serving them directly from a CDN, removing the requirement to manage or run web servers. - JAMstack

Static sites have a significantly smaller attack surface than sites driven by databases and application code. Bugs, vulnerabilities, and build issues can be identified by the build phase, rather than when the site is live.

You get the rich editing of Ghost while enjoying all of the operational and cost advantages of a static JAM site:

  • Never worry about scaling when traffic spikes.
  • Never apply a critical security patch at midnight.
  • Who cares if your Ghost database is temporarily down?

Making Ghost Static

First, setup a private Ghost instance to use for editing, formatting, and peer review. This can be a very small VM, since it doesn't have to serve any production traffic. Defying traditional best-practices, it's even safe to co-locate MySQL on the same system! Ghost(Valet) is very reasonably priced if you need a hand with this step.

You can secure access to this system with a firewall or VPN. A better approach is to use an authentication proxy like oauth2-proxy or pomerium for a user-friendly, "zero trust" solution.

Then, use Ghost Static Site Generator (gssg) to create a static mirror of the site that is hosted by Netlify:

Fried-Chicken/ghost-static-site-generator
Generate a static site from ghost and deploy using a CI - Fried-Chicken/ghost-static-site-generator
Static Hosting a Ghost Site

gssg will spider your local Ghost site, creating a fully static copy. Here's the short Makefile I use:

all:
	rm -rf static/
	gssg --url https://www.nedmcclain.com --domain http://127.0.0.1:2368
	cp ghost/etc/keybase.txt static/
	cp static/404/index.html static/404.html

You can publish this copy to your Netlify site with their CLI or by (recommended) linking a Git repository. Netlify offers 100GB of bandwidth/month for free, which is plenty for most use cases. They also take care of TLS certificates automatically.

Netlify: All-in-one platform for automating modern web projects
Deploy modern static websites with Netlify. Get CDN, Continuous deployment, 1-click HTTPS, and all the services you need. Get started for free.
Static Hosting a Ghost Site

Of course, there are dozens of static hosting options similar to Netlify. Github Pages, AWS S3+CloudFront, AWS Amplify, Firebase Hosting, and Azure Static Web Apps are all safe choices with generous free tiers.

A Critical Limitation: Interactivity 🚨

By creating a static Ghost site, you naturally break all interactive features. Say goodbye to Ghost's library of integrations, and you won't be able to monetize with Ghost's membership and subscription management.

It's easy to include contact/feedback forms with tools like Netlify Forms or Upscribe. You can embed rich media, social media, and all kinds of Javascript widgets (weather, news, etc.). But if you intend to have users login to your site, you'll need a database-backed service.

For paid creators, Substack is growing in popularity with its focus on newsletters. Truly complex sites with many editors should evaluate Drupal or consider custom development. But I would encourage you to check out Ghost(Pro) - your subscription goes back into the Ghost foundation, making the software even better.

It Works!

This site is created with Ghost and hosted by the fine folks at Netlify. It's fast:

Static Hosting a Ghost Site

The next time you need to deploy a fast, secure, simple blog, think about the juicy combination of Ghost's UI and free static hosting.

Let me know if you end up using this approach! Follow me at @nedmcclain for updates on web performance, infrastructure, and operations.

❤️

Note: I have no affiliation with or sponsorship from Ghost or Netlify. I pay for Netlify's "Pro" account, and it's worth every penny.

Ghost name and logo are trademarks of the Ghost Foundation.
Photo by
Patrick Tomasso on Unsplash.

]]>
<![CDATA[Introduction to ClickHouse Backups and clickhouse-backup]]>https://www.nedmcclain.com/introduction-to-clickhouse-backups-and-clickhouse-backup/65caf591b81cfc00016649adTue, 01 Sep 2020 00:25:00 GMT

I wrote this post for the Altinity Blog - the original can be found here: https://altinity.com/blog/introduction-to-clickhouse-backups-and-clickhouse-backup

The native replication support built into ClickHouse provides high availability and resilience against individual node failures. However, rare disaster scenarios may require recovering data from backups. These include data corruption and the failure of all replicas in a shard or cluster.

A critical component of any ClickHouse backup scheme is “freezing” tables. As with all databases, consistent backups depend on ClickHouse being in a “quiesced” state. Instead of having to halt the database entirely, ClickHouse has native support for “freezing” tables for backup or migration. This is a no-downtime operation.

Manual Backups in Four Easy Steps

ClickHouse includes native support for instantaneous point-in-time backups, through its ‘ALTER TABLE… FREEZE’ feature.

  1. Confirm your shadow directory is empty:
    ls /var/lib/clickhouse/shadow/
  2. Ask ClickHouse to freeze your table:
    echo -n 'alter table events freeze' | clickhouse-client
  3. Save your backup in case of disaster:
    cd /var/lib/clickhouse/
    sudo mkdir backup
    sudo cp -r shadow/ backup/my-backup-name
  4. Finally, clean up the backup source for next time:
    sudo rm -rf /var/lib/clickhouse/shadow/*

ClickHouse uses filesystem hard links to achieve instantaneous backups with no downtime (or locking) for ClickHouse services. These hard links can be further leveraged for efficient backup storage. On filesystems that support hard links, such as local filesystems or NFS, use cp with the -l flag (or rsync with the –hard-links and –numeric-ids flags) to avoid copying data.

When hard links are utilized, storage on disk is much more efficient. Because they rely on hard links, each backup is effectively a “full” backup, even though duplicate use of disk space is avoided.

Test Your Backup

It is rightly said that a backup is worthless if the restoration process hasn’t been tested. Perform regular test restores to ensure your data will be there when you need it.

Here are the steps for manual recovery:

  1. Drop your test table, or find another server for testing.
  2. Create your test table for recovery:
    cat events.sql | clickhouse-client
  3. Copy your backup to the table’s `detached` directory:
    cd /var/lib/clickhouse
    sudo cp -rl backup/my-backup-name/* data/default/events/detached/
  4. Attach the detached parts:
    echo 'alter table events attach partition 202006' | clickhouse-client
  5. Confirm your data has been restored:
    echo 'select count() from events' | clickhouse-client

Automate the Backup Process with clickhouse-backup

The clickhouse-backup tool, created by Alex Akulov, helps to automate the manual steps above: https://github.com/AlexAkulov/clickhouse-backup. We like clickhouse-backup and have implemented several new features, which are described here for the first time.

To get started you’ll need to install clickhouse-backup. Full instructions are in the ReadMe.md file. Here’s an example of installation from a tarball. RPMs, Debian packages, and Docker images are also available.

wget https://github.com/AlexAkulov/clickhouse-backup/releases/download/v0.5.2/clickhouse-backup.tar.gz

tar -xf clickhouse-backup.tar.gz
cd clickhouse-backup/
sudo cp clickhouse-backup /usr/local/bin
clickhouse-backup -v

The API features and new storage options like remote_storage described in this blog article are not yet available in an official build. You either need to build from source or run the latest docker image. Here’s an example of the latter.

docker run --rm -it --network host \   
  -v "/var/lib/clickhouse:/var/lib/clickhouse" \
  -e CLICKHOUSE_PASSWORD="password" \
  -e S3_BUCKET="clickhouse-backup" \
  -e S3_ACCESS_KEY="access_key" \
  -e S3_SECRET_KEY="secret" \
  alexakulov/clickhouse-backup:master --help

For the rest of the article we assume you have a build that has the new features. When used on the command line, clickhouse-backup requires a configuration file. Here is a minimal example.

$ cat /etc/clickhouse-backup/config.yml
general:
    remote_storage: none

You will need to add additional configuration options for non-default ClickHouse installations or authentication. A full config example can be created by running clickhouse-backup default-config. This is a great starting point for your use, showing all available settings.

Once configured, clickhouse-backup provides a variety of subcommands for managing backups.

$ clickhouse-backup help                                                                                                         NAME: clickhouse-backup - Tool for easy backup of ClickHouse with cloud support
...
COMMANDS:
        tables          Print list of tables
        create          Create new backup
        upload          Upload backup to remote storage
        list            Print list of backups
        download        Download backup from remote storage
        restore         Create schema and restore data from backup
        delete          Delete specific backup
        default-config  Print default config
        freeze          Freeze tables
        clean           Remove data in 'shadow' folder
        server          Run API server
        help, h         Shows a list of commands or help for one command

Just like the manual backup example above, you will need to use sudo or run clickhouse-backup as the clickhouse user.

The configuration file allows for certain databases or tables to be ignored. The tables subcommand will show you which tables will be backed up:

$ clickhouse-backup tables
default.events
system.metric_log   (ignored)
system.query_log    (ignored)
system.query_thread_log (ignored)
system.trace_log    (ignored)

Creating a backup is as easy as:

$ sudo clickhouse-backup create 
2020/07/06 20:13:02 Create backup '2020-07-06T20-13-02' 
2020/07/06 20:13:02 Freeze `default`.`events` 
2020/07/06 20:13:02 Skip `system`.`metric_log` 
2020/07/06 20:13:02 Skip `system`.`query_log` 
2020/07/06 20:13:02 Skip `system`.`query_thread_log` 
2020/07/06 20:13:02 Skip `system`.`trace_log` 
2020/07/06 20:13:02 Copy metadata 
2020/07/06 20:13:02   Done. 
2020/07/06 20:13:02 Move shadow 
2020/07/06 20:13:02   Done.

As you can see in the example above, the backup completed within the same second.

You can review existing local backups:

$ sudo clickhouse-backup list 
Local backups: 
- '2020-07-06T20-13-02' (created at 06-07-2020 20:13:02)

Note that `Size` is not computed for local backups for performance reasons.

Internally, clickhouse-backup utilizes hard links when possible, as described above. The backup is stored in /var/lib/clickhouse/backup/BACKUPNAME. The backup name defaults to a timestamp, but you can optionally specify the backup name with the –name flag. The backup contains two directories: a `metadata` directory, with the DDL SQL statements necessary to recreate the schema, and a `shadow` directory with the data as a result of the ALTER TABLE ... FREEZE operation.

Restoring from a backup is also easy. For example:

$ echo 'drop table events' | clickhouse-client  
$ sudo clickhouse-backup restore 2020-07-06T20-13-02 
2020/07/06 20:14:46 Create table `default`.`events` 
2020/07/06 20:14:46 Prepare data for restoring `default`.`events` 
2020/07/06 20:14:46 ALTER TABLE `default`.`events` ATTACH PART '202006_1_1_4' 
2020/07/06 20:14:46 ALTER TABLE `default`.`events` ATTACH PART '202006_2_2_2' 
2020/07/06 20:14:46 ALTER TABLE `default`.`events` ATTACH PART '202006_3_3_3' 
2020/07/06 20:14:46 ALTER TABLE `default`.`events` ATTACH PART '202006_4_4_3' 
2020/07/06 20:14:46 ALTER TABLE `default`.`events` ATTACH PART '202006_5_5_2' 
2020/07/06 20:14:46 ALTER TABLE `default`.`events` ATTACH PART '202006_6_6_1'

The restore subcommand automates both schema and data restoration. In case you only want to restore the schema, use the optional --schema flag. Or if you only want to restore the data (assuming the schema already exists), you can use the --data flag. The latter case is especially useful in restoring to a server that already has existing data.

Another useful feature is support for specifying a table pattern with most commands, such as create and restore. The --table argument allows you to backup (or restore) a specific table. You can also use a regex to, for example, target a specific database: --table=dbname.*.

Remote Backup Destinations

Of course, you could rsync your backup to a remote destination, save it to an object store like S3, or archive it using an existing backup solution. Local storage is usually insufficient to meet data durability requirements.

The clickhouse-backup tool supports uploading and downloading backups from a remote object store, such as S3, GCS, or IBM’s COS. A minimal AWS S3 configuration looks like:

s3:
	access_key: <YOUR AWS ACCESS KEY>   
    secret_key: <YOUR AWS SECRET KEY>   
    bucket: <YOUR BUCKET NAME>   
    region: us-east-1   
    path: "/some/path/in/bucket"

Once you have configured your credentials and destination bucket, clickhouse-backup can take care of the rest:

$ clickhouse-backup upload 2020-07-06T20-13-02 
2020/07/07 15:22:32 Upload backup '2020-07-06T20-13-02' 
2020/07/07 15:22:49   Done.  

The remote backup can be downloaded to local storage before restoration: 
$ sudo clickhouse-backup download 2020-07-06T20-13-02 
2020/07/07 15:27:16   Done.

The clickhouse-backup config file supports backups_to_keep_local and backups_to_keep_remote settings – tune them to meet your data retention requirements. For example, set backups_to_keep_local: 7 and backups_to_keep_remote: 31 to retain a week’s worth of nightly backups locally and a month’s worth remotely. Set both to 0 to disable backup pruning.

There is also a --diff-from option to the upload subcommand. This feature compares files to a previous local backup, only uploading new/changed files. It is essential you retain the previous backup in order to do a restore from the new backup.

Data transfer time and cost are critical aspects of remote storage. How long will it take to restore a large table to a new server? This will be largely dependent on network and storage bandwidth. It’s critical to test various recovery scenarios to have a good understanding of what recovery times you can achieve in a failure. Cost management will be important if you are using a public cloud.

Using the clickhouse-backup API

Finally, clickhouse-backup can run as a service that provides a REST API. This is a new feature. The API mirrors the command-line commands and options, and may be a more convenient way to integrate with an existing scheduling or CI/CD system.

$ sudo clickhouse-backup server & 
$ curl localhost:7171/backup/list 
{"Type":"local","Name":"2020-07-06T20-13-02","Size":0,"Date": "2020-07-06T20:13:02.328066165Z"}

Documentation of the API endpoints can be found here: https://github.com/AlexAkulov/clickhouse-backup#api

Using clickhouse-backup in Production

It is important to take note of the known limitations of clickhouse-backup, which are documented here: https://github.com/AlexAkulov/clickhouse-backup#limitations

In addition, the documentation contains this important warning:

Never change files permissions in /var/lib/clickhouse/backup. This path contains hard links. Permissions on all hard links to the same data on disk are always identical. That means that if you change the permissions/owner/attributes on a hard link in backup path, permissions on files with which ClickHouse works will be changed too. That might lead to data corruption.

Recovery Scenarios

Failed Replica

Failure of individual servers or nodes is by far the most common disaster scenario seen in production. In almost all cases, the failed replica should be replaced and the schema recreated. ClickHouse’s native replication will take over and ensure the replacement server is consistent. This failure scenario is worth testing in advance to understand the network and compute the impact of rebuilding the new replica.

Failed Shard

In a clustered environment, at least one replica from each shard should be backed up. The clickhouse-backup API is one approach for orchestrating backup naming and execution across a cluster.

If all replicas in a shard were to fail, or more commonly, data was corrupted, the entire shard must be restored from a backup as described above. Ideally, restore the backup to one replica, restore the schema to the others, and allow ClickHouse’s native replication take over.

Failed Cluster

A completely failed cluster, whether due to infrastructure failure or data corruption, can be restored in the same manner as a failed shard. One replica in each individual shard must be restored via the process above.

Alternate Backup Strategies

Offline Replica with Filesystem Snapshots

A common alternative is to use an “offline replica” for backups. A replica is configured (often in another region), which is not used for queries as part of any Distributed Tables. “Offline replica” should not plan to do any merges, which can be specified with always_fetch_merged_part and replica_can_become_leader ClickHouse MergeTree settings. While production replicas are best served by the ext4 filesystem, the backup replica uses ZFS (or another filesystem that supports snapshots.) This approach provides a quick restoration process. Note that backups, in this case, are still local to the server/node and do not necessarily provide sufficient data durability. ZFS provides directory-based filesystem access to individual snapshots, so it would be possible to automate the storage of these snapshots on a remote system or object store.

Storage-as-a-Service with Snapshots

It is common for Cloud deployments to use network-based block storage (such as AWS EBS or GCP persistent disks). Some on-prem deployments use Ceph or OpenEBS for this purpose. Each of these “storage-as-a-service” technologies supports transparent volume snapshots. By first freezing tables for backup and then creating a snapshot, you can achieve nearly instantaneous backups.

Internally, snapshots only store blocks on disk that have changed since a previous snapshot. While not a true “incremental” backup, these systems provide highly efficient use of disk space. Beware that network-based block storage is rarely as performant as local disk, and be sure to monitor snapshot retention and cost.

Using Kafka to Improve Backups

So far we have discussed specific point-in-time backups that are created on a nightly or hourly basis (for example). Some organizations require the ability to restore to any arbitrary point in time. Because ClickHouse doesn’t have a native binary log (such as the Postgres WAL), some other mechanism is needed to “replay” the data since the last specific point-in-time backup.

Many organizations use Kafka to meet this requirement. Streaming data through Kafka into ClickHouse has many advantages for availability and fault tolerance. Another advantage is the ability to reset ingestion to any offset in the Kafka partition. When performing point-in-time backups, the Kafka offset must be stored. During recovery, the ClickHouse Kafka Engine configuration is set to the Kafka offset at the time the backup was created, and data after that point in time will be ingested.

A simple way to store Kafka offsets is to INSERT them into a table in ClickHouse that is included in the backup. This can be done in a wrapper script that pauses ingest from Kafka, writes the current topic partition offsets, starts backup, and enables ingestion again. When you restore data, you can reset offsets for the consumer group, then re-enable ingest. See the ClickHouse Kafka engine tutorial on this blog for an example of resetting offsets.

Next Steps

Before diving in and implementing a backup solution, take a moment to reflect on the requirements of your business and end users. Get a good understanding of Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for your environment. Then, consider the approaches outlined above to determine which is the best fit.

One of ClickHouse’s greatest strengths is the support and contributions of a diverse and invested community. The automation provided by clickhouse-backup is no exception: great thanks to Alex Akulov!

At Altinity we care as much about your data as you do. Contact us for help with backups, or any other ClickHouse challenge!

Photo by Simon Migaj on Unsplash.

]]>
<![CDATA[Updated "firstboot" Release for Raspberry Pi OS]]>https://www.nedmcclain.com/raspberry-pi-os-firstboot/65caf591b81cfc00016649acThu, 27 Aug 2020 16:18:00 GMT

I'm pleased to announce an updated build of raspberian-firstboot based on the latest Raspberry Pi OS release (8/20/2020).

I created this solution to scratch a personal itch, so it's exciting to see the community of users and contributors continue to grow. The raspberian-firstboot image is being used to deploy home media centers, industrial process monitors, and a fleet of sensor hubs at a university lab.

nmcclain/raspberian-firstboot
A lightly modified Raspbian-light image supporting first boot customization. - nmcclain/raspberian-firstboot
Updated "firstboot" Release for Raspberry Pi OS

It's a simple, open source improvement on the base Raspberry Pi OS image:

The standard Raspbian-lite image allows you to customize the wireless settings and enable SSHd before flashing it to an SD card. Unfortunately, there is no way to further customize the OS during the first boot, nothing like cloud-init or userdata. Without a display and keyboard, complex "headless" deployments are impossible.

Important Changes

Since the 2020-02-13 raspberian-firstboot release, there have only been a few notable changes to the lite OS:

  • The name changed from raspberian to Raspberry Pi OS for political reasons. This is a good example of resolving an open source branding issue, and an excellent example to the larger community.
  • Internal audio outputs enabled as separate ALSA devices
  • Updated udev to add pwm to gpio group
  • i2cprobe: More flexible I2C/SPI alias mapping
  • Disk ID is now regenerated on first boot
  • More raspi-config control over boot device and EEPROM version
  • Updated Raspberry Pi firmware
  • Linux kernel 4.19.97 -> Linux kernel 5.4.51

Note that all of these improvements are thanks to the official Raspberry Pi OS release. The raspberian-firstboot image sprinkles a little sugar on top: a simple systemd script to run firstboot.sh on the initial startup.

Enterprise?

Updated "firstboot" Release for Raspberry Pi OS

This "firstboot" solution provides convenient dynamic fleet deployment for those comfortable with Linux who can customize their images. If you are looking for a turnkey, robust, enterprise-friendly fleet management platform, check out Balena.io. Balena offers custom images, web and API-based centralized management, and remote console access to your devices. I have no affiliation - just a fan of their technology.

Looking Forward

In the long term, I'll be excited to see this open source project die. In an ideal world, Raspberry Pi OS would provide a native solution for flexible on-deployment configuration of new nodes. I suspect they could find a more robust solution than the firstboot.sh bash script.

In the short term, I'd love to hear from more community members! I'm happy to tackle bug fixes and new features, but I especially enjoy hearing how you are operating your fleet of edge devices. Please let me know if I can help!

I'm focused where software, infrastructure, and data meets security, operations, and performance. Follow or DM me on twitter at @nedmcclain.

Raspberry Pi is a trademark of the Raspberry Pi Foundation, with which I am in no way affiliated.

]]>
<![CDATA[Better CSRF Protection]]>https://www.nedmcclain.com/better-csrf-protection/65caf591b81cfc00016649a7Wed, 25 Mar 2020 20:56:11 GMT"The secret to strong security: less reliance on secrets." – Whitfield DiffieBetter CSRF Protection

Web developers must protect their apps against Cross-Site Request Forgery (CSRF) attacks. If they don't, a hacker controlling some other web site could trick the app into taking action on behalf of an innocent user. Typically, web servers and browsers exchange secret CSRF tokens as their primary defense against these exploits.

CSRF tokens are traditionally stored either by the web server or in an encrypted cookie in the web browser. This article looks at an alternative approach to CSRF protection: on-demand, cryptographically signed tokens that require no storage.

Quick History of CSRF Protection

In all cases, the app server generates a token, stores it somewhere, and sends it to the browser. The browser includes this token in mutable requests, and the app checks to make sure it is valid.

The first generation of CSRF protection just stored tokens on a local disk. Eventually, apps had to scale beyond a single server and used a shared database to store the CSRF token. This approach is effective, but it adds unnecessary load on the database server.

Today's web frameworks use encrypted cookies to store the token in the browser. When making a request, the browser includes both the raw CSRF token and the encrypted cookie (containing the token). The server can decrypt the cookie and verify that the two tokens match. This cookie-to-header token scheme is secure because the browser has no way to decrypt the contents of the cookie.

This partial HTTP response shows the encrypted cookie _gorilla_csrf containing the CSRF token, as well as the X-Csrf-Token header with the raw token:

< HTTP/1.1 200 OK
< Set-Cookie: _gorilla_csrf=MTU4Mzg3Nzg4M3xJa2hhVUVSUU9WaFphSEJhV1RCSlNIQjJTbWhtU1cxWVMyeDZNM1JyWW1kSFFXVk5SVzlLZHk4dlJITTlJZ289fJoHPWZobIWPupouH6OGXF311g4FZO6wk5VVdWqFuKah; Expires=Wed, 11 Mar 2020 10:04:43 GMT; Max-Age=43200; HttpOnly; Secure; SameSite
< Vary: Cookie
< X-Csrf-Token: FYPX7DS5pqZ2AnuHC6PiAnDrcjEz/WQ8PXdivxLqjf4IEBTT4WEgMC7S+m63O70gFSHlDN5s3Do8lGYfjtVxxQ==

The double-submit-cookie scheme is similar, except the browser sends the raw CSRF token in a separate cookie. It is used by platforms like Django and Rails.

The conventional cookie-to-header and double-submit-cookie approaches are analogous to having the server store the token in the database. Instead of looking up the token in a database, the server "looks up" the token by decrypting the cookie sent from the browser. Then, in both cases, the server compares that token to the one sent with the browser's request.

Stateless CSRF Protection

Moving the token storage from server to client was a great innovation. The server no longer has to worry about it, and encryption keeps the token secure while the browser stores it. We have moved the "state" from your server to the user's browser.

Can we do better than this? Consider the benefits of a completely "stateless" solution:

  • Works on browsers that have cookies disabled
  • Supports WebSockets, which don't work with cookies
  • Reduces bandwidth, since the encrypted CSRF cookie doesn't get shipped with each request
  • Doesn't require a database or server-side storage

HMAC Based CSRF Tokens

All of the anti-CSRF techniques described above involve the app server generating a token, which is encrypted so the browser can't read it. An often-overlooked alternative is to have the server create a cryptographically signed token.

The token doesn't have to be hidden from the client, because it can only be created (signed) with a secret key on the server. Only the server can cryptographically validate the token so that an attacker cannot forge or tamper with it. Thanks to this validation, the server doesn't need to store the token anywhere.

When the browser makes a request, it includes the CSRF token just like usual. The server doesn't have to look anything up; it can use its secret key to confirm the token is valid. HMAC is used to create the signed token, so this architecture is officially dubbed the HMAC Based Token Pattern.

Better CSRF Protection
Sequence diagrams for Cookie-to-header and HMAC CSRF token schemes

This technique is lightweight and flexible by nature. Modern JavaScript applications can store the token wherever they please: application state (Redux/Vuex/etc.), browser localstorage, or even a cookie.

When making requests, the browser includes the CSRF token in a header field. This matches the other schemes described above; the only difference is that no encrypted CSRF cookie is required.

If you combine stateless CSRF with token-based sessions, your app might not even need cookies at all. Get your analytics privacy right, and you could say goodbye to Cookie Law popups.

In all, a win for performance, compatibility, and convenience.

Better CSRF Protection for Developers

🚨 If your web framework already provides CSRF protection, by all means use that! Security is generally a bad place to spend your innovation tokens.

The CSRF token gets created by applying an HMAC hash to the timestamp and session ID, which is then Base64 encoded.

func generateCSRFHash(ts, sessionid, key string) (string, error) {
    if len(key) < MIN_KEY_LENGTH {
        return "", fmt.Errorf("Key too short")
    }
    body := []byte(ts)
    body = append(body, []byte(sessionid)...)
    mac := hmac.New(sha256.New, []byte(key))
    mac.Write(body)
    return base64.StdEncoding.EncodeToString(mac.Sum(nil)), nil
}

The server sends the CSRF token to the client, usually in a header. Later, when the client sends the token back, the server can validate it without cookies or server-side storage.

const TSLen = 19

func ValidateCSRFToken(uid, key, token string, valid time.Duration) error {
    if len(token) < TSLen+1 {
        return fmt.Errorf("CSRF token too short")
    }
    ts := token[:TSLen]
    hash := token[TSLen:]

    expectedHash, err := generateCSRFHash(ts, uid, key)
    if err != nil {
        return fmt.Errorf("CSRF hash error")
    }
    if !hmac.Equal([]byte(hash), []byte(expectedHash)) {
        return fmt.Errorf("CSRF token invalid")
    }
    issuedTs, err := strconv.ParseInt(ts, 10, 64)
    if err != nil {
        return fmt.Errorf("CSRF timestamp invalid")
    }
    issued := time.Unix(0, issuedTs)
    if issued.Add(valid).Before(time.Now()) {
        return fmt.Errorf("CSRF token expired")
    }
    return nil
}

There are existing CSRF HMAC implementations for NodeJS and Java, but I haven't tried them. It feels like there is an opportunity to add CSRF HMAC support to many other web frameworks.

No matter what approach you use for CSRF protection, it is worthless without proper configuration. Use HTTPS and Origin checks, and check referer headers. Ensure that your CORS policy is strict. Yeah, a * wildcard doesn't count.

If you're using cookies, consider SameSite, Secure, and HttpOnly options, as well as a Vary: Cookie header to avoid caching. And never roll your own encryption.

All anti-CSRF schemes have one other fatal weakness. It's game over if an attacker can run malicious JavaScript in your client's browser.

Every web and API developer should be familiar with Cross-Site Scripting (XSS), one of the most common vulnerabilities on the internet.

All anti-CSRF schemes have one other fatal weakness, as every web and API developer should know: Cross-Site Scripting (CSS) attacks, one of the most common vulnerabilities on the internet. It’s game over if an attacker can run malicious JavaScript in your client’s browser. Review your CSS controls and make sure they meet the current recommendations.

Aloha

Find this approach compelling? Spy a glaring security vulnerability? I'm focused where software, infrastructure, and data meets security, operations, and performance. Follow or DM me on twitter at @nedmcclain.

]]>
<![CDATA[Why DevOps ❤️ ClickHouse]]>https://www.nedmcclain.com/why-devops-love-clickhouse/65caf591b81cfc00016649a2Thu, 27 Feb 2020 15:44:04 GMTHumans' ability to adapt to technological change is increasing, but it is not keeping pace with the speed of scientific & technological innovation. To overcome the resulting friction, humans can adapt by developing skills that enable faster learning & quicker iteration & experimentation. – Astro TellerWhy DevOps ❤️ ClickHouse

Databases hold a special place in the hearts of operations folks. Usually, not a place of love. Stateful services are challenging to operate. Data is "heavy." You only get two of the following three essential guarantees: consistency, availability, and partition tolerance.

Databases take 10x or more effort to operate in production than stateless services.

Analytics databases have even more unique operational challenges. "Data lakes" are typically massive in scale and demand "interactive" query performance. The queries they service are fundamentally different from what a relational database sees. Analytics is all about fast queries against big data that grows continually but never changes.

ClickHouse/ClickHouse
ClickHouse is a free analytics DBMS for big data. Contribute to ClickHouse/ClickHouse development by creating an account on GitHub.
Why DevOps ❤️ ClickHouse

I've spent the past couple years working with ClickHouse, and it is my number one tool for solving large-scale analytics problems. ClickHouse is blazing fast, linearly scalable, hardware efficient, highly reliable, and fun to operate in production. Read on for a rundown of ClickHouse's strengths and recommendations for production use.

Quick Jump to a Section


Rule of DevOps: Default to Postgres or MySQL

I must repeat a cardinal rule of DevOps before extolling the virtues of ClickHouse: Default to Postgres (or MySQL). These databases are proven, easy to use, well understood, and can satisfy 90% of real-world use cases. Only consider alternatives if you're pretty sure you fall into the 10%.

Big data analytics is one use case that falls into that rare 10%. ClickHouse might be a good fit for you if you desire:

  • Rich SQL support, so you don't have to learn/teach a new query language.
  • Lightning-fast query performance against essentially immutable data.
  • An easy-to-deploy, multi-master, replicated solution.
  • The ability to scale horizontally across data centers via sharding.
  • Familiar day-to-day operations and DevOps UX with good observability.

What Makes ClickHouse Different


ClickHouse is a column-store database, optimized for fast queries. And it is fast. The ClickHouse team boldly (and accurately) claims:

"ClickHouse works 100-1,000x faster than traditional approaches."

Traditional databases write rows of data to the disk, while column-store databases write columns of data separately. For analytics queries, the column-store approach has a few key performance advantages:

  1. Reduced disk IO: Analytics queries often focus on a handful of columns. Traditional databases must read each row, with all its columns, off the disk. A column-store database reads only the relevant columns from disk. With high-cardinality analytics databases, disk IO is reduced by 100x or more. In general, disk IO is the primary performance bottleneck for analytics, so the benefit here is tremendous.
  2. Compression: By nature, each column stores very similar data. Imagine a food preference field allowing any, vegan, and vegetarian. Compressing a file with three repeating values is incredibly efficient. We get about 20x compression in a large production ClickHouse environment I help manage. Highly compressed data means even less disk IO.
  3. Data locality: The image below shows traditional row storage on the left and column storage on the right. The red blocks in the image below represent data from a single column. With a high-cardinality analytics query, reads are mostly contiguous instead of spread across the disk. These queries work hand-in-hand with the OS's read-ahead cache.
Why DevOps ❤️ ClickHouse
Traditional row storage vs. column-store https://clickhouse.yandex/blog/en/evolution-of-data-structures-in-yandex-metrica

Let's look at a simple dining_activity database table. We'll use it to track who's eating the most food the fastest, and maybe encourage them to go for a jog.

CREATE TABLE dining_activity (username text, food_eaten text, ounces int, speed int, dined_at date) ...

On the left, you can see that MySQL stores the entire table's worth of data in a single file on disk. ClickHouse, on the right, stores each table column in a separate file. This is the essence of a column-store database.

Why DevOps ❤️ ClickHouse
Our table stored on disk: MySQL InnoDB vs. ClickHouse column-store.

ClickHouse is purpose-built for Online Analytical Processing (OLAP), not Online Transaction Processing (OLTP). It is designed for analysis of immutable data, such as logs, events, and metrics. New data arrives all the time, but it is not changed.

⚠️Do not use ClickHouse where you need frequent UPDATEs or DELETEs. ClickHouse provides fast writes and reads at the cost of slow updates.

NOTE: ClickHouse does support UPDATE and DELETE queries, but only with eventual consistency. These operations are slow due to ClickHouse's MergeTree implementation. This feature is useful for GDPR compliance, for example, but you can be sure your transactional app just won't work well.

Ultimately, the column-store architecture is faster and more cost-effective for analytics use cases. The CloudFlare team shares their success with ClickHouse in the article below.

HTTP Analytics for 6M requests per second using ClickHouse
One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options:
Why DevOps ❤️ ClickHouse

A Note About RedShift

AWS RedShift is an excellent hosted solution; I have used it successfully for several projects. As a traditional RDBMS, it supports transactional and analytics applications. If you need a mostly Postgres-compatible analytics database and cost is no issue, definitely consider RedShift.

On the other hand, this Altinity benchmark helps to demonstrate ClickHouse's impressive performance/cost ratio. The server supporting the green ClickHouse bars costs about $190 per month. The servers supporting the red and orange RedShift bars below cost about $650 and $4,100 per month, respectively. Compared to RedShift on a ds2.xlarge server, ClickHouse offers ~10x better performance at roughly a 30% cost savings.

Why DevOps ❤️ ClickHouse
https://www.altinity.com/blog/2017/7/3/clickhouse-vs-redshift-2

Key use cases where ClickHouse may be a better fit than a Postgres cluster like RedShift include:

  • Non-transactional use: Based on Postgres, RedShift can support both analytics and transactional use cases. ClickHouse does not support the latter, so it's only appropriate for analytics use cases.
  • Geographic sharding: RedShift is designed to be deployed in a single AWS Region. Compliance requirements or the performance desire to serve content close to your users may drive you toward ClickHouse.
  • Capital matters: You will spend more on both compute and storage with a RedShift cluster for equivalent analytics performance.

What about GCP's BigQuery? BigQuery is an outstanding solution for non-interactive analytics. However, I have not seen evidence that it performs fast enough for "interactive" analytics use. It is typical for BigQuery to take more than a few seconds to return results.


Why DevOps ❤️ ClickHouse

ClickHouse is close to perfect for both data scientists and operations people. It is easy to use, manage, observe, scale, and secure. With SQL and ODBC support, it's an equally powerful analytics back end for both developers and non-technical users.

This section explores ClickHouse's unique characteristics and includes recommendations for using ClickHouse in production. Already convinced that ClickHouse might be useful? Please bookmark this for when you are ready to move to production!

Usability

The practice of DevOps values business impact and ultimately, the end user, above all else. We want to provide services that are easy to digest.

It's unreasonable to ask your analysts and data scientists to learn a new query language. They expect and deserve SQL. Their tools and training are going to fall over if you build a MongoDB analytics solution, and they have to start doing this:

db.inventory.find( { $or: [ { status: "A" }, { qty: { $lt: 30 } } ] } )

ClickHouse works well with graphical data analysis tools because it supports a fairly standard dialect of SQL. It also includes a bunch of non-standard SQL functions specifically designed to help with analytics.

Support for ODBC means ClickHouse works great with familiar tools such as Tableau, Power BI, and even Excel.

ClickHouse/clickhouse-odbc
ODBC driver for ClickHouse. Contribute to ClickHouse/clickhouse-odbc development by creating an account on GitHub.
Why DevOps ❤️ ClickHouse

For even less technical users, start them off on the right foot with a web-based analysis tool:

  • Grafana can speak to ClickHouse and it produces visually stunning dashboards.
  • The Tabix web UI was built explicitly for ClickHouse. In addition to data exploration, query building, and graphing, Tabix has tools for managing ClickHouse itself.
  • Redash sports similar features to Tabix but with richer functionality. It can pull data from a variety of sources besides ClickHouse, merging results into reports and visualizations.
  • ClickHouse empowers many other commercial and open source GUI tools.

To the developer audience, ClickHouse has libraries for your favorite language, including Python, PHP, NodeJS, and Ruby, among others. As a picky Go developer, it has been a pleasure using the clickhouse-go library to talk to ClickHouse.

Want to use ClickHouse but don't want to update your code? Java developers will appreciate native JDBC support. Others will find the native MySQL interface works with their language's standard MySQL library. It speaks the MySQL wire protocol and is mighty fast.

Data scientists will be right at home with ClickHouse's support for Python Jupyter Notebooks. Jupyter is the standard workflow tool for statistical analysis and machine learning. Accessing ClickHouse data from Jupyter is a breeze with the clickhouse-sqlalchemy Python library.

Operators will love the fact that ClickHouse has a standard CLI clickhouse-client for interacting with the database. It works a lot like MySQL's mysql and Postgres' psql commands. It feels familiar.

And oh yeah, ClickHouse has some pretty great documentation.

Finally, ClickHouse exposes an HTTP interface. It makes doing a server health check as simple as:

$ curl http://localhost:8123
Ok.

You can even run queries with curl:

$ curl 'http://localhost:8123/?query=SELECT%20NOW()'
2020-02-15 20:10:21

Data Management

The libraries mentioned above allow you to integrate ClickHouse with your custom software. This affords infinite flexibility but is certainly not required to get data into ClickHouse for analysis.

The clickhouse-client CLI has a simple way to load bulk data from files. It supports basic formats like CSV, TSV, and JSON. It also supports many modern data formats such as Apache Parquet, Apache Avro, and Google's Protobuf. Using one of these newer formats offers enormous advantages in performance and disk usage.

Of course, you can also use the clickhouse-client CLI to export data in each of these formats.

We looked at ClickHouse's ability to act as an ODBC server above. ClickHouse can also act as an ODBC client. ClickHouse will ingest data from an external ODBC database and perform any necessary ETL on the fly. It can do the same for external MySQL databases and JDBC database endpoints. ClickHouse also has a SQL abstraction over existing HDFS volumes, making Hadoop integration easy.

There are several off-the-shelf solutions for streaming data into ClickHouse. With built-in support for Apache Kafka, ClickHouse can publish or subscribe to Kafka streams. This fits nicely in existing Kafka streaming architectures and works well with AWS MSK. A third-party library exists to stream data from Apache Flink into ClickHouse. Have an existing ClickHouse deployment? The clickhouse-copier tool makes it easy to sync data between ClickHouse clusters.

Let's face it: System and application logs are a prime candidate for ClickHouse data. An agent can forward those logs to your database. One option is clicktail, which was ported from Honeycomb.io's honeytail to add ClickHouse support. Another old favorite is LogStash. You'll need to install the logstash-output-clickhouse plugin, but then you'll be able to write logs directly to ClickHouse.

Backups, restore, and disaster recovery are thankless tasks for operators. Analytics databases are particularly tricky because they are too big to just dump to a single backup file. It helps to have tools that are easy to use and scale well.

ClickHouse provides just the right tools for this need. Table data gets split up on disk into "parts," each managed independently. It's easy to "freeze" a part, and save the "frozen" files using rsync, filesystem snapshots, or your favorite backup tool. You can specify how parts are created based on your use case: Your parts could hold a week's worth of data, or an hour's worth. This abstraction makes managing Petabyte-scale data sets possible.

There are also some friendly safeguards built into ClickHouse, focused on preventing human error. You must jump through an extra hoop, for example, when dropping tables over 50 GB. I embrace this feature with open arms.

Deployment

ClickHouse scales handily, from your laptop to a single server to a globally distributed cluster. Get familiar with the server, client, and configuration on your workstation. All of those skills will transfer to production.

ClickHouse is written in C++ and supports Linux, MacOS, and FreeBSD. DEB and RPM packages make installation easy. You should be aware of Altinity's Stable Releases. ClickHouse is moving fast, sometimes breaking things, and I've encountered some serious problems using the latest "release" from GitHub. Be sure to use one of Altinity's recommended releases for production use.

The ClickHouse docker container is easy to use and well maintained. It's a great way to experiment locally, or deploy into your containerized environment. At Kubernetes shops, clickhouse-operator will get you up and in production in no time. It will manage your K8s storage, pods, and ingress/egress configuration. Just run:

kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/manifests/operator/clickhouse-operator-install.yaml

Ansible admins can get a head start with this ansible-clickhouse role.

Finally, the corporate sponsor of ClickHouse offers a managed ClickHouse in the cloud. I note it here to be complete but have not used it personally. I suspect we won't see an AWS-managed ClickHouse anytime soon. It could undercut RedShift revenue!

Availability

This section is short. High availability is easier with ClickHouse than any other database I've ever used. Except for DNS, of course :)

Replication guarantees that data gets stored on two (or more) servers. ClickHouse's painless cluster setup is one of my favorite features as an operations person. Replication is trivial to configure: You need a Zookeeper cluster and a simple config file.

If you've ever set up a Postgres/MySQL cluster, you know it's no fun. Configuring replication involves a careful sequence of manual steps: locking the tables (impacting production), quiescing the database, noting the binlog location, snapshotting the volume, unlocking the tables, rsync'ing the binlogs to the follower, loading the binlogs on the follower, setting the binlog location on the follower, starting replication, etc... With ClickHouse, this is almost entirely automatic.

Instead of the standard master/follower model used by Postgres and MySQL, ClickHouse's replication is multi-master by default. You can insert new data into any replica, and can similarly query data against any replica. The multi-master architecture saves the need to front your database with load balancers. It also eases client configuration and saves your developers time and effort.

The Zookeeper dependency mentioned above is the only real cost of using ClickHouse's replication. That's three more servers (or pods) you must deploy, patch, and observe. Zookeeper is pretty low maintenance if it's set up correctly. As the documentation says, "With the default settings, ZooKeeper is a time bomb." Looking for a slightly more modern alternative? Check out zetcd from CoreOS.

Performance and Scalability

Play with ClickHouse for a few minutes and you'll be shocked at how fast it is. An SSD-backed server can query several GB/sec or more, obviously depending on many variables. For example, you might see queries achieve 20 GB/sec, assuming 10x compression on a server with SSD that can read 2 GB/sec.

Detailed ClickHouse performance benchmarks are provided here.

Single-server performance is fun to explore, but it's fairly meaningless at scale. The service must be able to scale "horizontally," across dozens or hundreds of individual servers.

It uses a cluster of 374 servers, which store over 20.3 trillion rows in the database. The volume of compressed data, without counting duplication and replication, is about 2 PB. The volume of uncompressed data (in TSV format) would be approximately 17 PB. - ClickHouse use at Yandex

Yandex operates ClickHouse at this scale while maintaining interactive performance for analytics queries. Like other modern databases, ClickHouse achieves massive horizontal scaling primarily through sharding.

With sharding, servers store only a subset of the database table. Given two servers, one might store records starting with A-M while the other stores records for N-Z. In practice, ClickHouse uses a date/time field to shard data.

With ClickHouse, database clients don't need to know which servers will store which shards. ClickHouse's Distributed Tables make this easy on the user. Once the Distributed Table is set up, clients can insert and query against any cluster server. For inserts, ClickHouse will determine which shard the data belongs in and copy the data to the appropriate server. Queries get distributed to all shards, and then the results are merged and returned to the client.

Why DevOps ❤️ ClickHouse
Via: https://www.altinity.com/blog/2018/5/10/circular-replication-cluster-topology-in-clickhouse

Sharding allows for massive horizontal scaling: A sharded table can be scaled out to hundreds of nodes, each storing a small fraction of the database.

Here are a few more performance tips:

  • Materialized views. Instruct ClickHouse to perform aggregation and roll up calculations on the fly. Materialized views are the secret to making dashboards super-fast.
  • User-level quotas. Protect overall performance by limiting user impact based on # of queries, errors, execution time, and more.
  • TTLs for data pruning. It's easy to tell ClickHouse to remove data after some lifetime. You can set Time-To-Live for both tables and individual columns. Meet your compliance requirements and avoid DBA effort with this feature.
  • TTLs for tiered data storage. ClickHouse's TTL feature recently added support for moving data between different tiers of storage. For example, recent "hot" data could go on SSD for fast retrieval, with older "colder" data stored on spinning magnetic disks.
  • Performance troubleshooting with clickhouse-benchmark. You can use this tool to profile queries. For each query, you can see queries/sec, #rows/sec, and much more.

⚠️ Don't even think about running ClickHouse in production without reviewing the Usage Recommendations and Requirements pages. They contain essential advice on hardware, kernel settings, storage, filesystems, and more. Use ext4, not zfs.

The standard ClickHouse cluster deployment (replication, sharding, and Distributed Tables) is very effective. It's probably enough for most use cases. For those who desire additional control, CHProxy is gold.

Similar to an HTTP reverse proxy, CHProxy sits in between the ClickHouse cluster and the clients. CHProxy gives you an extra layer of abstraction:

  • Proxies to multiple distinct ClickHouse clusters, depending on the input user.
  • Evenly spreads requests among replicas and nodes.
  • Caches responses on a per-user basis.
  • Monitors node health and prevents sending requests to unhealthy nodes.
  • Supports automatic HTTPS certificate issuing and renewal via Let’s Encrypt.
  • Exposes various useful metrics in Prometheus format.

Importantly, CHProxy can route clients directly to the appropriate shard. This eliminates the need to use Distributed Tables on INSERT. ClickHouse is no longer responsible for copying data to the appropriate shard, lowering CPU and network requirements.

Why DevOps ❤️ ClickHouse

Use CHProxy to provide universal ingestion endpoints that ship events to wholly distinct ClickHouse clusters. Give certain clients a high-priority cluster. Send client data to a "close" network location, or a compliant geography. Or shunt misbehaving clients to a dedicated cluster (or /dev/null).

Observability

ClickHouse is easy to observe: Metrics are readily available, event logging is rich, and tracing is reasonable.

Prometheus is today's standard cloud-native metrics and monitoring tool. The clickhouse_exporter sidecar works great in both containerized and systemd environments.

f1yegor/clickhouse_exporter
This exporter is now maintained in the Percona-Lab fork https://github.com/percona-lab/clickhouse_exporter. This is a simple server that periodically scrapes ClickHouse(https://clickhouse.yandex/…
Why DevOps ❤️ ClickHouse

Prometheus and CHProxy's integration with Grafana means you get beautiful dashboards like this:

Why DevOps ❤️ ClickHouse
https://grafana.com/grafana/dashboards/869

ClickHouse supports shipping metrics directly to Graphite. There are also Nagios/Icinga and Zabbix checks for ClickHouse; you don't have to embrace Prometheus.

As mentioned above, a simple GET request to the ClickHouse HTTP endpoint serves as a healthcheck. To monitor replicas, you can set the max_replica_delay_for_distributed_queries parameter and use each server's /replicas_status endpoint. You'll get an error response with details if the replica isn't up to date.

Logging with ClickHouse is good, but not ideal. Verbose logs are provided and contain useful stack dumps. Unfortunately, they are multi-line messages, requiring special processing. JSON-formatted event logs are not an option.

ClickHouse tracing data is stored in a special system table. This approach is far from optimal for your standard log ingestion tool, but it works for troubleshooting specific issues. There has been some discussion about adding OpenTracing support.

Despite the promises of DevOps, production operators still regularly need to troubleshoot individual services. Similar to Postgres, you can introspect ClickHouse using SQL. Internals such as memory utilization, replication status, and query performance are available.

To bookend this section, it's worth noting that ClickHouse also makes a robust back end for storing large-scale Graphite metrics.

Security

Infrastructure folks know HTTP. Securing it is second nature. We have a boatload of proxies, Layer 7 firewalls, caching services, and networking tools to leverage against the HTTP protocol. ClickHouse and CHProxy provide a standard HTTPS endpoint that requires no new infrastructure tooling.

ClickHouse expects you to configure individual users. You could use root for all access, but individual users are supported for those who care about security. You can grant each user access to a list of databases, but there are no table-level controls. Read-only permissions are available, but they apply to all databases the user can access. It's impossible to grant a user read-write access to one database and read-only access to another.

The limitations above may be problematic for some. If you need further control over untrusted users, CHProxy provides it. CHProxy's separate abstraction of users brings its own security and performance controls. Hide ClickHouse database users behind public-facing CHProxy users. Limit per-user query duration, request rate, and # of concurrent requests. CHProxy will even delay requests until they fit in per-user limits, which is perfect for supporting dense analytics dashboards.


ClickHouse for SysAdmins

Sysadmins and DevOps folks will discover one other compelling ClickHouse use case: clickhouse-local as a command-line Swiss army knife.

The clickhouse-local program enables you to perform fast processing on local files, without having to deploy and configure the ClickHouse server.

I have used tools like sed, awk, and grep on the command line for several decades. They are simple, powerful tools that can be chained together to solve novel problems. I'll always keep them in my pocket.

Still, these days I often turn to clickhouse-local when scripting or writing bash aliases. It can take tabular output from tools like netstat or ps and run SQL queries against it. You can sort, summarize, and aggregate data in ways that are impossible with traditional tools. Like sed and friends, clickhouse-local follows the Unix philosophy.

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. – Doug McIlroy, Bell Labs

Add the following to your .bash_profile, and you'll be able to use nettop10 to see the remote hosts with the most connections.

alias netop10=NetstatTopConn
function NetstatTopConn()
{
        netstat -an | egrep '^..p' | awk '{print $1 "\t" $4 "\t" $6}' | clickhouse-local -S 'proto String,raddr String,state String' -q 'select count(*) as c,raddr from table group by raddr order by c desc limit 10 FORMAT Pretty'
}

Sample clickhouse-local alias entry in .bash_profile

There are plenty of other great usage examples in the clickhouse-local docs.

clickhouse-local is also popular with data science folks. It's a blazingly fast way to do quick ad-hoc analysis, and it supports all the file formats discussed earlier. Take a huge CSV or Parquet file and run SQL queries directly against it with clickhouse-local. No server required!


Next Steps

Whether you are a data scientist working on your laptop, a developer, or an operations engineer, ClickHouse deserves a place in your toolbelt. It is lightning fast for analytics use cases, efficient with compute and storage resources, and a pleasure to operate in production.

The key to good decision making is not knowledge. It is understanding. We are swimming in the former. We are desperately lacking in the latter.” ― Malcolm Gladwell

Want to learn more? Check out the ClickHouse QuickStart. The Altinity blog is also a fountain of ClickHouse knowledge.

Are you tackling a data analytics challenge? I'm focused where software, infrastructure, and data meets security, operations, and performance. Follow or DM me on twitter at @nedmcclain.

]]>
<![CDATA[Pandemic Preparation for IT Teams]]>https://www.nedmcclain.com/pandemic-preparation-it-teams/65caf591b81cfc00016649a4Wed, 26 Feb 2020 15:49:36 GMT“Remember that you are a Black Swan.” ― Nassim Nicholas Taleb
Pandemic Preparation for IT Teams
Pandemic Preparation for IT Teams

I spent much of 1999 preparing for the Y2K bug. When the clock struck midnight, nothing broke. The business felt it received very little return for our efforts.

In 2000 and 2001, I helped Empire Blue Cross with IT disaster planning. The 9/11 attack on the office at WTC 2 tested our plans and, in the future, I found disaster planning budgets suddenly plentiful.

Some might consider these experiences failures. They are wrong. Highly-visible threats get people out of their comfort zone and serve to motivate investment in preparation for a "worst-case" scenario.

The COVID19 outbreak has reached near-pandemic levels. Now is not the time for alarm. However, it is the perfect opportunity to take a moment and think about how ready your business is for catastrophe. The US CDC issued Interim Guidance for Businesses and Employers this month. Every business leader should be familiar with its contents. The WHO just told countries to prepare for a pandemic. Markets are reacting.

Dow Drops 1900 Points In 2 Days As Markets Sell Off On Fears Of Coronavirus Spread
The Dow Jones Industrial Average fell 879 points. That’s on top of Monday’s drop, when the Dow tumbled more than 1,000 points.
Pandemic Preparation for IT Teams

In this post, I bundle up practical Business Continuity advice for IT teams. Often, BC is considered only as a business-wide problem. In reality, IT has some unique BC considerations. These recommendations are generally useful in preparing for any kind of disaster, from a viral pandemic to hurricanes, earthquakes, civil unrest, regional fires, or a nearby release of hazardous materials.

IT Continuity

Infected workers must stay home, even if they are showing minimal symptoms. Others must be at home to take care of an infected child, or parent. Widespread school and daycare closures could keep a large swath of parents tied to the house. Air travel could become more painful than usual. Regional quarantines are being imposed in Italy, China, and South Korea.

The best thing you can do is to enable remote work.

Just like everything else in computers, this feature must be tested. Plan an office-wide remote workday in the coming weeks.

  • Ensure each team member's home connection is stable enough for work. Note: in a pandemic situation, home internet services will likely be overwhelmed with whole neighborhoods working (or streaming) from home.
  • Confirm videoconferencing and phone bridge licenses are sufficient.
  • Validate your VPN/Citrix/remote access servers can support the entire team working at once.

Knowledge Sharing: A potential pandemic raises the chance critical staff will be out of the office or even completely unavailable. Access to services and systems is paramount: if you don't have one, deploy a shared password manager immediately. Everything in your business with a login should be accessible by at least two individuals. Caution: 1Password is my choice, but be sure that folks are storing business credentials in "shared", not "private" vaults.

Documentation and cross-training are also essential. Like the other measures in this post, they will yield compound returns on investment in the coming years. Even in the optimistic case that we see no major disasters.

Supply Chain: A direct supply chain from China is vital to manufacturers like Apple. COVID19 has already impacted Apple's financial forecasting, and we might not even see the iPhone 12 this Fall. Say you're a startup SaaS... is your supply chain important?

Take a moment to ponder if you can go 3+ months without ordering new laptops, phones, storage, or networking devices. In addition to manufacturer shortages due to closed factories, package deliveries see significant delays. Good luck on-boarding new hires!

If you have the capital to create a small hardware stockpile, do it. If not, this may be one of a few cases that debt makes sense. Lease a few extra laptops.

Dependencies: Can your team make forward progress without an internet connection? Git is supposed to make this easy, but heavy dependencies on NPM or Github could stop developers on their tracks. Standard vendoring practices for local development can keep the wheels of progress spinning.

Monitoring and alerting: Significant absenteeism guarantees your on-call rotation will blow up. An alert routing solution like PagerDuty will help tamp down the flames. By investing time now in prioritizing alerts, you can adjust alarms based on team availability.

Game Day

"Disruption to everyday life might be severe," says Nancy Messonnier, who leads the coronavirus response for the U.S. Centers for Disease Control and Prevention. "We are asking the American public to work with us to prepare for the expectation that this is going to be bad."

In an ideal world, IT teams would carve out time to perform tabletop Disaster Recovery (DR) and Business Continuity (BC) exercises quarterly. Sadly, these drills get overlooked far too often. We may have the utmost confidence in our load-balancers, our server availability, and our geographic redundancy. Engineers gravitate to technical problems – the human element gets pushed aside.

"A learning organization, disaster recovery testing, game days, and chaos engineering tools are all important components of a continuously resilient system." – Adrian Cockcroft, Failure Modes and Continuous Resilience

One software developer mantra is: "if it's not tested, it's not done." Documenting policies and processes is only the beginning. Everyone on the team must be familiar with them through hands-on experience. Google's approach to DiRT (Disaster Recovery Testing) is a spectacular model:

Using SRE and disaster recovery testing principles in production | Google Cloud Blog
See how you can use SRE and CRE principles and tests from Google, including Wheel of Misfortune and DiRT, to reduce the time needed to mitigate production incidents.
Pandemic Preparation for IT Teams

Situational Awareness

You do not want to be stuck in the office during an emergency. If an evacuation is necessary, your team wants to be with their family and community. We're all pretty aware of news as it happens, but you must be prepared to act decisively. Don't hesitate to get your team on the road before traffic makes the trip home impossible.

Generally, ensure you are receiving NOAA Severe Weather Alerts; I use the AccuWeather app for this. Subscribe to your local reverse-911 system. Specific to Coronavirus, checking the Johns Hopkins CSSE map and this update feed may give you an early heads-up for your region.

If a disaster happened, a NOAA Weather Radio and/or device with satellite weather alerts could be your only source of news. Throw one in the storage closet.

Work Environment

Person-to-person Hygiene: Individuals are a critical link in the health safety chain. Put up a poster or send a reminder email about personal health hygiene. Hopefully, basics like hand washing, not touching your face, and coughing in your elbow are already common knowledge. I'm also a big fan of switching to elbow bumps instead of handshakes (aka: "the Ebola handshake"). This could seem socially awkward to some, but doing it as a team can make it fun.

Pandemic Preparation for IT Teams
The Ebola Handshake, via: https://www.news.com.au/world/africa/un-ambassador-shows-off-new-ebola-handshake/news-story/07f792dd112f07ef2d904937a7d87e2e

Office Hygiene: Give hand sanitizer and surface wipes a prominent place in your office. Smaller personal/team fridges are better than full-sized ones. Talk through your cleaning contract with both the cleaning team and your staff. Your team needs to know precisely which housekeeping tasks are their responsibility.

Travel and Events: The CDC has issued a Level 3 travel warning for China and South Korea: Avoid all nonessential travel to these regions. Iran, Italy, and Japan are at Level 2, in which older adults and those with chronic illness should stay home. Hong Kong is at a "watch" level. Latin America's first case was just detected.

This list is likely to grow until a pandemic is declared. At that point, containment efforts such as quarantine and travel restrictions will be mostly abandoned.

Conference planners should engage health safety experts for events in the coming months. As of now, the 2020 Olympics are on for Tokyo, but IOC members are raising concerns. There is no reason to cancel events in the US at this time, but every reason to think ahead and plan for the impacts of mandated social distancing or quarantine.

Emergency Supplies: The Red Cross recommends three days of emergency supplies stored at the office for short-term sheltering in place. This includes: nine meals worth of non-perishable food and three gallons of water per person; fans or indoor-safe propane heaters, as appropriate; toilet paper, soap, and paper towels; flashlights and batteries.

Medical: Ensure your office space has a well-stocked First Aid kit. You would typically go to the hospital for an in-office paper-cutter accident, but with a potentially overloaded health system, the wait could be many hours. You need to be able to treat "minor" injuries without visiting the ER. If your team doesn't have emergency medical experience, bring the Red Cross in for a First Aid training. Every office should stock an Automated External Defibrillator.

Personal Protective Equipment (PPE): Face masks and latex gloves are the norm in healthcare, and visible all over Asia right now. I doubt that IT teams should stock them in the office. Face masks are most effective for preventing the spread of a virus. They are much less useful for avoiding getting sick. If someone has symptoms, send them home immediately. Good interpersonal and office hygiene should be a higher priority.

Environmental Protection: In 2016, a toxic gas cloud threatened almost 20 million people near Los Angeles. If your office is within ten miles of a highway, railway, or factory, you should have what it takes to protect your team from this danger. Identify an interior conference or basement room, keep some plastic wrap and duct tape around, and purchase a carbon monoxide detector.

URGENT - IMMEDIATE BROADCAST REQUESTED
SHELTER IN PLACE WARNING - 12:00 PM PST SAT DEC 20 2016

THE LOS ANGELES POLICE DEPARTMENT ARE ASKING ALL RESIDENTS OF LOS ANGELES COUNTY TO STAY INDOORS. 

AT 12:00 PACIFIC STANDARD TIME A CAR ACCIDENT HAS OCCURRED ON INTERSTATE 5 OUTSIDE LOS ANGELES COUNTY.

THIS MOTOR VEHICLE ACCIDENT INCLUDED A SEMI TRUCK CARRYING 5 TONS OF SULFURIC ACID AND TOXIC GAS. FIRE FROM ACID HAD RELEASED TOXIC GAS THAT IS IN THE AIR. THE DIRECTION THAT THE WIND PUSHED THE GAS INTO DOWNTOWN LOS ANGELES. SHELTER IN PLACE IS IN EFFECT WITHIN 5 MILES OF THIS AREA. 
 
PLEASE DO NOT GO OUTDOORS AND REMAIN AWAY FROM WINDOWS OR ANY OPEN DOORS. PLEASE STAY TUNED TO MEDIA OUTLETS FOR FURTHER INFORMATION.

Real shelter-in-place warning from LAPD, 2016.

Power: Every desk should have an Uninterruptible Power Supply (UPS). Your equipment gets protected from surges and brownouts during "normal" times. During an emergency, you can recharge your phone a hundred times with the thing.

If your building is lucky enough to have a generator, awesome. UPSs are still required to protect electronics from the power spike your generator may produce when it starts up. Check with your facilities expert, so your team knows how long you can expect your fuel supply to last. Also, understand your fuel delivery vendor's SLA: they can only deliver so much fuel each day in crisis, and you pay for your place on the list.

Personal

Planning: Remind staff to make or update their emergency plan. Red Cross recommends a two week supply of food, water (1 gal. per person per day), and essentials at home. There is a lot of stress and emotion involved in considering disastrous situations. Some people actively "prep," while others have more pressing matters or concerns. Decisions in this space are an individual choice.

Communications: Getting in touch with family and friends is vital in an emergency. Cell phone network capacity could quickly become exhausted. Landlines are more likely to work in an emergency, but do you have one at home? I have a friend in Seattle who's worried about staying connected in case of a serious earthquake. He purchased inReach Satellite Communicators for each of his family members.

Prescriptions: A pharmacy is the last place you want to visit during a pandemic. Imagine being stuck for hours in a long line of sick people. Additionally, we may see pharma supply chain problems due to factory closures in China. Encourage staff to fill and keep on-hand an extra month's worth of prescriptions.

Emergency Car Supplies: Stock your car in case you get stuck in the car on a snowy night. It happens regularly here in Colorado. On the other side of the country, hurricane evacuations have left people stuck on the highway overnight. Toss a blanket, small shovel, First Aid kit, fire extinguisher, snacks, and a gallon of water in the trunk.

Looking Forward

Pandemic Preparation for IT Teams
Preparation will power up your business. Photo by Thomas Kelley on Unsplash.

There is so much we don't know about COVID19: transmission details, percentage of cases requiring hospitalization, morbidity... if containment is even possible. With luck, COVID19 will fizzle out much like MERS. In that happy case, your efforts on these IT Business Continuity controls will not go to waste.

Just like my Y2K preparations two decades ago, investing in this work now will pay off for the next decade. Rather than worrying about COVID19, keep your focus on how powerful it is for your business to be ready for whatever is around the corner.

"Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better." – Nassim Nicholas Taleb, Antifragile: Things That Gain from Disorder

I'm focused where software, infrastructure, and data meets security, operations, and performance. Follow or DM me on twitter at @nedmcclain.

]]>
<![CDATA[Custom Object Recognition for Non-Developers]]>From software development to economics to marketing, familiarity with machine learning will make you better at your job. The power of machine learning is in your reach today. You don't have to understand models or tensors or any math at all. Now is the time to add ML

]]>
https://www.nedmcclain.com/custom-object-recognition-for-non-developers/65caf591b81cfc00016649a1Wed, 19 Feb 2020 16:52:00 GMT

From software development to economics to marketing, familiarity with machine learning will make you better at your job. The power of machine learning is in your reach today. You don't have to understand models or tensors or any math at all. Now is the time to add ML to your professional tool belt.

"I think A.I. is probably the single biggest item in the near term that’s likely to affect humanity." - Elon Musk

Let's be honest: the ML space is a jungle. "GPUs and clusters and python, oh my!" What's the best way to use ML in your world?

I've had the pleasure of using a variety of ML tools in the past year, and have stumbled on a wonderful workflow. This workflow is easy for non-technical users and flexible enough for the most demanding data scientists. I recently used it to train a neural network to identify a custom type of object: ski lifts.

Read on and I'll share the best MLOps tools and workflow for individuals and small teams.

ML Research vs. MLOps

MLOps is applied machine learning. Just like you don't have to know how to program in C++ to use your Chrome browser, a deep understanding of machine learning internals is not necessary to apply ML to your business.

Machine learning can solve a vast diversity of business problems, including recommendation engines, natural language understanding, audio & video analysis, trend prediction, and more. Image analysis is an excellent way to get familiar with ML, without writing any code.

Transfer Learning

It is expensive and time-consuming to train a modern neural network from scratch. Historically, too expensive for individuals or small teams. Training from scratch also requires a gigantic data set with thousands of labeled images.

These days, transfer learning allows you to take a pre-trained model and "teach" it to recognize novel objects. This approach lets you "stand on the shoulders of giants" by leveraging their work. With transfer learning, you can train an existing object recognition model to identify custom objects in under an hour. Although the first papers about transfer learning came out in the 90's, it wasn't practically useful until recently.

YOLOv3: You Only Look Once

There are many solid object recognition models out there - I chose YOLOv3 for its fast performance on edge devices with minimal compute power. How does YOLOv3's neural network actually work? That is not a question you have to answer in order to apply ML to your business.

YOLO: Real-Time Object Detection
You only look once (YOLO) is a state-of-the-art, real-time object detection system.
Custom Object Recognition for Non-Developers

If you are curious, PJReddie's overview is both educational and entertaining. Also, check out the project's license and commit messages if you are looking for a good laugh.

Below, we'll take a quick look at using the Supervise.ly platform for labeling, augmentation, training, and validation. We'll also talk about using AWS Spot Instances for low-cost training.

Supervise.ly

Supervise.ly is an end-to-end platform for applying machine learning to computer vision problems. It has a powerful free tier, a beautiful UI, and is the perfect way to get familiar with machine learning. No coding required.

Supervisely - Web platform for computer vision. Annotation, training and deploy
First available ecosystem to cover all aspects of training data development. Manage, annotate, validate and experiment with your data without coding.
Custom Object Recognition for Non-Developers

Note that AWS has a comparable offering: AWS SageMaker. I've had great luck with Sagemaker and can recommend it to organizations that are large enough to have a dedicated "operations" person. For smaller teams, Supervise.ly is my go-to tool.

Custom Object Recognition for Non-Developers
Sagemaker has a friendly IDE and automates many tasks, but it still sits on rather complex infrastructure. https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

Another caution with SageMaker is that it requires familiarity with Jupyter Notebooks and Python. Jupyter Notebooks are amazing, and there's always at least one open in my browser tabs. The interactive, iterative approach is unbeatable for experimentation and research. Unfortunately, they are unaccommodating to non-developers.

Supervise.ly's web UI is a much more accessible tool for business users. It also supports Jupyter Notebooks for more technical users.

AWS Spot Instances

The one trick to Supervise.ly is you need to provide a GPU-powered server. If you have one under your desk, you're all set. If not, AWS is a cheap and easy way to get started.

AWS's p3.2xlarge instance type is plenty powerful for experimentation and will set you back $3.06 per hour. Training the YOLOv3 model to recognize chair lifts took under 15 minutes - costing way less than a latte.

That's not a bad deal, but AWS Spot Instances are even better. Spot Instances are interesting because the prices change over time, and there is a possibility AWS will shut your instance down after an hour. These limitations have no impact on our use case. I highly recommend clicking the "Request Spot Instances" checkbox.

Custom Object Recognition for Non-Developers
p3.2xlarge spot prices in January 2020 - saving you 93%

Do It Yourself

Now is a great time to train your own custom model. I'll walk you through it in the ten-minute video below.

If you'd prefer a written tutorial, Supervise.ly has a good one here. It's also okay to just skip this section if you don't want hands-on experience.

If you follow along, you'll need to hit "pause" in a few places while we wait on the computer. You'll also need the image augmentation DTL below:

Example DTL for image augmentation with Supervise.ly
Example DTL for image augmentation with Supervise.ly - supervisely_dtl_example.json
Custom Object Recognition for Non-Developers

⚠️ Important: Be sure to terminate your spot instance when you have finished testing your model!

Next Steps

Custom Object Recognition for Non-Developers
Object tracking in production

In this post, we trained our own model to recognize custom objects. In a future post, I will share easy ways to deploy your custom model. We'll learn how to deploy it on your laptop, as a cloud API, and to an "edge" device like a Raspberry Pi.

"I think because of artificial intelligence, people will have more time enjoying being human beings." - Jack Ma

Looking for help applying machine learning to your business problems? I'm focused where software, infrastructure, and data meets security, operations, and performance. Follow or DM me on twitter at @nedmcclain.

]]>
<![CDATA[Edge Networking’s Last Resort]]>https://www.nedmcclain.com/edge-networkings-last-resort/65caf591b81cfc000166499bFri, 24 Jan 2020 15:03:00 GMT

As the number of IoT devices has exploded, so has the compute power of each individual device. This post looks at a simple way to use that power to improve the availability of edge/IoT devices.

The Problem

IoT devices tend to have unreliable network connections, by their nature of being at the “edge”. Whether it’s flakey wired connections or WiFi that drops every two hours, these devices need a robust way to recover. It’s far too common for an IoT device to get “wedged” and not come back online.

Cloud servers have management APIs that allow an administrator to reboot a hung machine. Physical servers typically have serial consoles and remotely-managed BIOS and power supplies — or at least someone who can go kick the server.

Edge devices often have no remote management options. When their network connection fails, they need to handle the issue autonomously. Adding to the challenge, the exact steps these IoT devices must follow are unique to their environment.

Edge Network Watchdog

I created edge-netdog as a last-resort solution for IoT devices that experience network problems. It is simple, flexible, and intended to run on small Linux devices like the Raspberry Pi, Jetson Nano, or Wyze Cam.

Rather than testing individual network services like WiFi, DNS, and routing, edge-netdog checks for access to a website. This “end-to-end” testing ensures that all networking components are working in harmony. If edge-netdog detects a network outage, it will perform a set of mitigation actions, one at a time.

Here’s a simple edge-netdog configuration. After two minutes of network downtime (4 checks x 30 seconds), it runs the first mitigation action. If reconfiguring the WiFi hasn’t fixed networking, edge-netdog will try refreshing DHCP and restarting the networking service. Finally, if all else fails, it will reboot the device.

---
global:
    debug: true
    monitor_interval: 30s
    target_attempts: 4
    action_delay: 30s
    target_url: https://example.com
    target_match: Example Domain
actions:
    - sudo wpa_cli -i wlan0 reconfigure
    - sudo dhclient -
    - sudo service networking restar
    - sudo reboot

This approach is effective for IoT devices, but also heavy-handed. It’s definitely inappropriate for any device that offers remote out-of-band management. Don’t use edge-netdog on your servers!

Using the Processor’s Watchdog

Edge Networking’s Last Resort
Photo by Urban Sanden on Unsplash.

The edge-netdog tool works well for detecting network issues, but it’s worthless if the device’s operating system fails. We need a second watchdog!

Modern processors offer an internal watchdog for just this scenario. Once enabled, the processor will force a reboot if it isn’t “patted” by the OS every so often.

For example, the Raspberry Pi’s processor watchdog interacts with the Linux watchdogd service. The Linux service performs some health checks and “pats” the processor’s watchdog. If a health check fails, or if watchdogd itself stops working, the processor watchdog force a reboot.

The watchdogd service has a bunch of configurable health checks, such as max-load-1, min-memory, max-temperature, and the ability to watch arbitrary text files. It’s even possible to configure rudimentary network health checks, but they are limited to ping’ing a hardcoded IP address. The edge-netdog tool supplements watchdogd with higher-level network checking and more customizable recovery actions.

Every edge/IoT device running Linux should have watchdogd configured. Here’s an excellent guide with more technical details.

General Tips for Stability

  • Power supply: Flakey power and brownout conditions are responsible for many IoT problems. Ensure your power supply meets the device’s requirements, and deploy a battery backup in locations that have unreliable power.
  • SD card/flash: SD cards are flakey and happy to corrupt themselves on power failure. Make sure to use branded, high-end SD cards for your production deployments. It’s also worth considering an industrially-hardened IoT device, like the Balena Fin. Instead of an SD card, the Fin has an on-board eMCC flash chip for storage.
Edge Networking’s Last Resort
  • USB/SATA drive: SD cards are great, but an external hard drive is essential for I/O intensive applications. An SD card will wear out after 100k write cycles. Databases, monitoring systems, and similar apps will quickly run into problems with SD wear and deserve a real disk.
  • Enclosure: Picking an appropriate enclosure for your IoT device is key. The case should support the device’s cooling needs, including room for heat sinks and possibly a fan. Weatherproof cases should be used where appropriate.
  • Hardware Watchdog: There are situations where even the processor’s watchdog can’t help. Power surges or brownouts can leave the processor unable to perform a soft reboot. A hardware watchdog provides the ultimate insurance in these cases — it can perform a hard power cycle if the device becomes hopelessly wedged. There are many good options for Raspberry Pi and Arduino-compatible devices.

Let’s talk about IoT!

If edge-netdog is working for you [or not!], I’d love to hear about it below!

Looking for help with your IoT/edge-networking? I consult on strategy, architecture, security, deployment, and fleet management — let’s chat!! Follow or DM me on twitter at @nedmcclain.

]]>
<![CDATA[Software for Fresh Powder]]>https://www.nedmcclain.com/software-for-fresh-powder/65caf591b81cfc000166499dMon, 20 Jan 2020 17:50:00 GMT

My favorite time on skis or a snowboard is when there is fresh powder. At the resort, some of the deepest snow can be found when runs open at the beginning of the season.

Unfortunately, lift and trail openings are rarely announced in advance. Ski patrol works hard to avoid avalanche danger — often judging openings on the spot. How is a powder hound to know when their favorite tree run just opened?

With lift and trail status updates to their phone, of course!

Software for Fresh Powder
Copper’s new Three Bear’s Lift is open!

Turns out, most trail status pages are powered by a simple JSON document! I wrote a quick tool named FreshPow to send Slack alerts when a lift or trail changes status — read on for details.

How It Works

When you visit a typical ski resort’s status page, your browser requests lift and trail info from the server. Javascript on the browser renders the JSON data in an attractive way for you to see. If you wanted to, you could sit at your browser and reload the status page every five minutes to check when your favorite run opens.

The FreshPow tool does the same thing — it checks the status data for changes every five minutes. When it notices a change, it fires off a Slack alert. Here’s what the raw data for a single trail looks like:

Software for Fresh Powder
Pinball is closed, but it will hold deep powder when it opens!

You can download FreshPow for Windows, Linux, macOS, and RaspberryPi from the Github page. Run it from the command line to check things out:

./freshpow --debug Vail

You’ll need to set the SLACK_WEBHOOK and SLACK_CHANNEL environment variables to enable notifications. I chose Slack, but it’d also be easy to add support for other channels like email or text messages.

As a Colorado skier, I focused on support for local resorts, but there’s no reason we couldn’t add other areas:

$ ./freshpow -l
Supported resort names: Eldora, Steamboat, Keystone, CrestedButte, Copper, Snowshoe, Blue, Breckenridge, BeaverCreek, Dev, Stratton, Tremblant, WinterPark, Vail

A Historical Perspective

While we’re at it, why not track this data over time? It’s fun to look back as trails and lifts open over the season. And by incorporating weather and snow data, you can pick out interesting trends.

Here are a few shots of my Copper Mountain dashboard:

To achieve this, I pull the same JSON data into a Prometheus server and user Grafana for visualization. It’s fine to use a phone or web browser to check your ski status dashboard, but even better to display it for everyone to see. Ours is next to the breakfast table, where we can figure out what layers, wax, and skis/boards to ride for the day.

An old iPad is probably the best option for displaying the results. I used a Raspberry Pi 4, 7-inch touch display, and SmartiCase as a slick, low-cost alternative.

Software for Fresh Powder

If you put FreshPow to work for you, I’d love to hear about it — get in touch below!

If you need other resorts or notification methods added, please open an issue! Pull requests with improvements and new features are also very welcome!!

I'm focused where software, infrastructure, and data meets security, operations, and performance. Follow or DM me on twitter at @nedmcclain.

]]>
<![CDATA[Cloud Native Tools Every Enterprise CTO/CIO Should Have on Their Radar in 2020]]>https://www.nedmcclain.com/cloud-native-tools-every-enterprise-cto-cio-should-have-on-their-radar-in-2020/65caf591b81cfc000166499eTue, 07 Jan 2020 18:05:00 GMTCloud Native Tools Every Enterprise CTO/CIO Should Have on Their Radar in 2020
Cloud Native tools landscape from https://landscape.cncf.io/
Cloud Native Tools Every Enterprise CTO/CIO Should Have on Their Radar in 2020

Some cloud native tools are only appropriate for “cloud native” startups. But look at the main goals of cloud native software, and you’ll see they are tightly aligned with enterprise digital transformation:

Operability, Observability, Elasticity, Resilience, and Agility.

This post identifies the mature cloud native tools that every enterprise should consider in 2020. The tools highlighted below are robust, proven, and hardened for enterprise use today. These are not just incremental improvements on existing IT tools; they are enabling technologies that cut costs, improve security, and foster a stronger DevOps culture.


Infrastructure as Code

If your ops team is still SSH’ing or RDP’ing into servers to make changes, Infrastructure as Code (IaC) should be your highest priority. Capturing your server, OS, and app configurations “as code” means you can track changes in a version control system like Git.

Although IaC feels a little odd to traditional sysadmins at first, it yields a ton of benefits. Everyone can see what changes were made, when, and by who — at all layers of the IT stack. It’s easy to detect configuration drift, and you can prepare for a security audit in no time.

Even though Chef and Puppet are the most popular tools, I usually recommend Ansible and Terraform to my enterprise clients. They are cross-platform (and cross-cloud) and are the most accessible IaC tools to understand, teach, and share — making them the most DevOps-friendly. Packer is worth investigating for anyone making their own VM or cloud server images.

Service Discovery

The vast majority of enterprises I’ve seen use DNS for “service discovery” — although sadly, hardcoded IP addresses are also common. Experienced IT admins can share plenty of war stories about DNS caching and propagation times. When you are targeting more than “three nines” of availability, traditional DNS servers aren’t sufficient for High Availability failover.

There are dozens of Service Discovery tools on the cloud native landscape — and many of them are not so useful in the enterprise datacenter. Still, a few gems exist that can empower enterprise IT. They have service health checks, automated failover, work with existing enterprise applications, and expose API-based management. Deploying service discovery usually means that legacy hardware load balancers can be decommissioned.

For enterprises looking to take advantage of Service Discovery, I usually recommend one of three approaches:

  1. Start with DNS: Use a Service Discovery platform that supports DNS. Consul, etcd, and CoreDNS expose their service discovery data through both DNS and a simple API. This approach allows legacy applications to route traffic in a flexible way, without load balancers or code changes.
  2. Grow to a Service Proxy: Use a proxy to present your services. This architecture has most of the features of cutting-edge service discovery solutions, without a lot of new technology to understand.
    The cloud native world has brought us dozens of great options — I’ve found that enterprise IT shops get a lot of ROI out of traditional proxies like Apache and Nginx. They are familiar, quick to deploy, and plenty powerful.
    If you are looking for more modern features, Envoy, Kong, and Tyk are worth exploring.
  3. Consider Mesh Networking: If your enterprise is dealing with “hybrid cloud,” a service mesh network could be your savior. Building on pure Service Discovery and Service Proxy tools, mesh solutions provide a secure, transparent network across data centers and clouds.
    Some of the most popular tools in this space are tightly bound to Kubernetes, making them difficult to use with legacy applications. Others work well outside Kubernetes and are easily accessible to traditional IT. For enterprises who aren’t full-in on Kubernetes, I recommend Consul, Flannel, or Weavenet (shout out WeaveWorks Denver office!).

Zero Trust Security

The past few decades have witnessed an evolution in remote access security models. In the early days, dial-up connections and leased lines were considered “secured” by the magic of Ma Bell.

More recently, security depended on a firewalled network perimeter and remote VPN access. This model is fondly known as “Tootsie Pop” security — hard on the outside and chewy in the middle. [Hat tip to Trent R. Hein for this!]

Cloud Native Tools Every Enterprise CTO/CIO Should Have on Their Radar in 2020

Today, the preferred practice is “Zero Trust Security.” Equally applicable to startups and Fortune 500s, this approach securely authenticates access to individual services, regardless of where on the network the user resides. At a high level, Zero Trust security harkens back to the failed attempts at enterprise “Single Sign On” Portals that Oracle and CA hawked in the ’90s.

Cloud Native Tools Every Enterprise CTO/CIO Should Have on Their Radar in 2020
Google’s https://beyondcorp.com/

Zero Trust is more secure than a VPN model — it is also much more user-friendly. Google helped raise the Zero Trust security model to prominence with its BeyondCorp initiative.

Google’s Cloud Identity and Access Management service is an easy way to get started with the BeyondCorp model. Alternatively, my go-to open source tool in this space is oauth2_proxy, which “just works.”

For SSH-centric shops, Gravitational’s Teleport is spectacular. Sharing and recording SSH sessions benefits everything from knowledge-transfer to incident response to compliance.

Citrix remains the go-to solution for Windows-centric shops where a majority of applications are NOT web-based.

Observability & Monitoring

There’s an ongoing debate in the cloud native community about what “observability” even means. Maybe it’s a three-legged stool, or maybe it’s a high-cardinality event analysis tool? Without going down that rat-hole, I think there are a few opportunities for enterprises to leverage progress in this space:

  1. Monitoring and metrics collection has come a long way since Nagios was first incarnated in 1999. Cloud native monitoring platforms have excellent performance, robust alerting/notification systems, beautiful dashboards, and work well with dynamic resources like cloud servers and VMs. Prometheus is the de-facto cloud native monitoring tool, with Grafana for visualizations. Prometheus’ support for exposing status to legacy tools like Nagios means that most enterprises find the transition painless and rewarding.
  2. Centralized event logging is a well-established IT service in most enterprises. The next steps are instrumenting applications to produce structured events and debugging traces. With this data, a high-cardinality analysis tool will give your team deep insight into the experiences of individual users. Unfortunately, organizations that are not developing software in-house will not be able to take advantage of the latest advances in event analysis and application tracing. The key opportunity here is to push vendors to produce detailed, structured events across the IT stack.
  3. Many IT organizations are severely deficient in event/metric retention. When I was a security incident investigator, my heart always sank when we found the hacked organization only had the last week’s worth of logs. Sadly, the truth is that most security incidents take months, not weeks to discover. Your organization should be able to look back through application and infrastructure event logs for at least the past year.

Object Storage

Databases have been around for more than 50 years. While we’re historically familiar with Relational Databases (1970), the cloud native ecosystem has brought us completely new types of databases: NoSQL, NewSQL, Document, Time-Series, and Colunm-Store, to name a few.

Object Storage is a type of cloud native database that enterprises can benefit from now. Far from a Relational Database, Object Storage systems are optimized for storing files. SQL isn’t supported — you store and fetch files based on a URL. Simplicity is its strength: Object Storage is supported by most modern analytics, security, and data science tools.

Although AWS has had S3 object storage since 2006, Google and Azure recently built very competitive offerings. All of these hosted Object Storage services are protocol-compatible, so they are somewhat of a commodity [data gravity aside]. Likely, some department in your enterprise is already using S3 today (marketing, data science, IT, etc.).

DANGER: Many organizations have exposed sensitive information via misconfigured AWS S3 and GCP GCS buckets recently. You MUST have a solid “Cloud Object Storage” security policy automated monitoring in place before storing non-public information in the cloud.

Many enterprises will find performance and security improvements by hosting Object Storage on-prem. For those that want a self-hosted solution, I recommend MinIO. While there are several commercial, self-hosted, S3-compatible products [1][2][3], in this case, the open source choice is best:

MinIO | High Performance, Kubernetes-Friendly Object Storage
MinIO’s High Performance Object Storage is Open Source, Amazon S3 compatible, Kubernetes Friendly and is designed for cloud native workloads like AI.
Cloud Native Tools Every Enterprise CTO/CIO Should Have on Their Radar in 2020

Tread With Caution

Cloud Native Tools Every Enterprise CTO/CIO Should Have on Their Radar in 2020

I’m convinced that the tools above hold tremendous opportunity for every enterprise. They provide significant gains, and a pilot can be run quickly without staff training or forklift technology replacement.

Despite the risk of Twitter backlash, there are two popular cloud native tools that I would recommend enterprises avoid rushing to adopt in 2020: Kubernetes and Serverless.

Kubernetes and containerization are very powerful abstractions. They make a natural evolution in the datacenter after Infrastructure as Code is in place. However, these tools require tight collaboration between Dev and Ops teams and are a rough fit for Enterprises with legacy software.

Serverless promises to liberate IT from server provisioning, patching, and maintenance. Unfortunately, it generally requires re-writing applications from the ground up. While Serverless may be worth investigating for green-field projects, it’s a bad fit for in-production Enterprise applications.

Seize This Opportunity!!

Yes — these tools are awesome technical enablers that will yield real operational and capital savings! But that’s not even the best thing about them:

Cloud native tooling can help to foster powerful collaboration between Dev and Ops teams. These tools all provide better visibility across the organization, create agility for rapid response, and improve safety by lowering the risk of changes.

Consider these cloud native tools in your 2020 plans!

Evi Nemeth used to say that open source software is “free, like a box of puppies.” Even with no licensing cost, you’re going to be taking care of the thing and picking up after it. As you explore open source in your organization, treat it like “real” software: allocate the resources you would to any commercial software deployment project.


Looking for help evaluating cloud native tools in your enterprise? I'm focused where software, infrastructure, and data meets security, operations, and performance. Follow or DM me on twitter at @nedmcclain.

]]>
<![CDATA[Frontend Monitoring with Prometheus]]>https://www.nedmcclain.com/frontend-monitoring-with-prometheus/65caf591b81cfc000166499fThu, 19 Dec 2019 19:47:00 GMTSoftware testing serves to “pin” features in place — providing assurance they are not broken by future changes.Frontend Monitoring with Prometheus

Our industry has reached a consensus that testing is essential for software quality. This post explores an underappreciated aspect of software testing, and how it can be leveraged beyond the development environment.

Frontend testing offers a view of software health that is closest to the end user’s experience. If a frontend test fails, you can be sure something is broken in your application. But a failed frontend test doesn’t provide much insight into what is broken.

As Nathan Peck points out in his post on the “the testing pyramid”, frontend tests (AKA: UI Tests) are just the tip of the pyramid. Lower-layer tests provide more visibility into the causes of failures, and most importantly, they are not as “brittle”. Frontend tests are considered brittle because they must be updated most frequently, and view the system as a whole rather than decomposed parts.

Frontend Monitoring with Prometheus
Credit: https://medium.com/@nathankpeck/microservice-testing-introduction-347d2f74095e

Frontend: The Swiss Army Knife of Testing

I agree that frontend tests are wholly insufficient for ensuring code quality. I get why they are generally disliked throughout the software development lifecycle. As a developer, I recognize how frustrating it is to update dozens of tests for each update. And as a long-time Ops engineer, I tried to stay away from frontend tests like the plague.

However, I think the true value of frontend tests has gone unrecognized. Sure they are useful from development through the CI/CD pipeline, but they are also incredibly useful in production! There is a huge missed opportunity in not re-using frontend tests for production observability!! Gain deeper insight into your production application's health — an additional return on an already-booked investment.

Sadly, all too often testing is considered the domain of software engineers. After all, operations engineers are supposed to focus on observability, not testing… right? Wrong!

In this post, our goal is to leverage frontend browser testing for production observability and monitoring. We’ll use nightwatchjs_exporter to capture the results of Nightwatch.js tests with the Prometheus monitoring tool.

Frontend Monitoring with Prometheus
Production Monitoring of Nightwatch.js Test Results

Automating Frontend Testing with Nightwatch.js

Nightwatch.js provides a DevOps-friendly frontend testing framework — tests are easy to write, and everyone on the team can read them without training. Developers write tests as code is created, quality engineers use and improve them, and even product managers will be comfortable capturing key user stories as formal tests.

Frontend Monitoring with Prometheus
Example test definition from nightwatchjs.org.

Nightwatch.js tests run in a real headless Chromium browser, with the same DOM and Javascript engine used by Chrome — Firefox, Microsoft Edge, and Safari drivers are also supported. Developers can test on their laptops using Docker containers, and CI/CD testing can employ a variety of operating systems and browser versions.

Frontend Monitoring with Prometheus
Example test output from nightwatchjs.org.

After all this investment in mature frontend tests, why are Ops teams failing to use them in production?


Production Frontend Testing with Prometheus.io

Prometheus.io has become the de-facto cloud-native monitoring tool, but it’s also powerful in SMB and enterprise environments. Can you find Prometheus’ white-on-red torch in the “Where’s Waldo” of the Cloud Native Computing Foundation’s landscape?

Frontend Monitoring with Prometheus
Prometheus.io is one of a handful of official CNCF “Graduated Projects.”

For data collection, Prometheus depends on exporters that provide correctly-formatted metrics. Good examples are the node_exporter for system data, MySQL stats exporter, and the HAProxy exporter.

To tie Nightwatch.js and Prometheus together for use in production, I wrote nightwatchjs_exporter. It does two things:

  1. Runs Nightwatch.js tests on a schedule.
  2. Exports test results in Prometheus format.

https://github.com/nmcclain/nightwatchjs_exporter


Using nightwatch_exporter

Feel free to skip this section if you’re not interested in technical details!

Gorey implementation details are provided in the README, but this overview will give you a feel for how things fit together:

  • nightwatchjs_exporter executes a set of Nightwatch.js tests on a regular schedule, storing the test results.
Frontend Monitoring with Prometheus
nightwatchjs_exporter is configured with a few command-line flags.
  • Prometheus scrapes the test results from nightwatchjs_exporter on a regular schedule, again storing the test results for analysis and alerting.
Frontend Monitoring with Prometheus
Prometheus “scrape” configuration pointing at nightwatchjs_exporter.

This decoupled approach follows the standard Prometheus exporter architecture. It ensures that the monitoring server’s performance isn’t impacted by frontend test execution.


Not a Replacement for Wholistic Observability

The technique of applying frontend tests in production can be a super-power for startups and B2B applications — but it’s not a replacement for truly capturing each user’s actual experience.

Whether you call it Real User Monitoring (RUM), high-cardinality analysis, or traditionally, centralized event logging, this part of observability is perhaps the most valuable. I’m an advocate of retaining these events for at least 15 months so you can compare user experiences against “this time last quarter.”

At massive scale, using frontend tests for monitoring starts to lose value. With hundreds of thousands of daily users, canary tests and high-cardinality observability tools can offer deeper insight than independent browser tests.


I encourage you to check out nightwatchjs_exporter and start leveraging frontend tests in your production environment!

Found a bug innightwatchjs_exporter? Want to contribute? Open an issue!

Looking for help with your observability strategy? I'm focused where software, infrastructure, and data meets security, operations, and performance. Follow or DM me on twitter at @nedmcclain.

]]>