One of Bedatadriven’s core projects is ActivityInfo, a database platform for humanitarian relief operations and development assistance.
Affected populations plotted by size and type on a base map of Health Zones in Eastern DRC |
Originally developed for UNICEF’s emergency program in eastern Congo, today the system is used by over 75 organizations working in Africa and Asia, tracking relief and development activities, across more than 10,000 project sites. With ActivityInfo, project managers can quickly establish an online database that reports the results of educational projects, maps activities that improve water and hygiene, tracks the delivery of equipment to clinics or any other humanitarian activities a project undertakes.
Field offices are able to collect key data about a relief operation’s activities, either through an offline-capable web interface or push results through a RESTful API. These results are then available to managers at a project or programme level and to the Donor organisations that fund the operations and assistance.
Using ActivityInfo:
- Less time spent on reporting and collecting data, more on delivering practical aid and support to vulnerable people and communities
- Builds a unified view of a humanitarian programme’s progress, across partners, regions and countries
- Improves program quality, with faster and more accurate feedback into the project cycle
Choosing our Architecture
Although the code for ActivityInfo is open sourced, our vision is to offer the system as a central service to the UN, NGOs and others at ActivityInfo.org, allowing them to focus on delivering the humanitarian programmes to some of the world’s most vulnerable populations. In choosing our infrastructure for ActiviyInfo.org, we had several criteria:
- Given the challenging environments that ActivityInfo users work in and the nature of the crises, we needed a platform that could ensure that the system was highly available.
- Minimal system administration, allowing bedatadriven’s focus to remain on product development - delivering the tools and functions users need to manage successful relief operations.
- A platform that could scale up and down according to the load, with minimal human intervention. The platform had to be scale automatically, as during a peak in a humanitarian crisis, when load can increase by an order of magnitude or more.
- Clear monitoring tools to help pinpoint performance problems. Physics imposes a minimum latency of nearly 900 milliseconds per request for satellite connections, so it’s essential for us to keep the server response time as low as possible to ensure a responsive experience for users.
As our user base grew, we moved first from a single machine to another Java PaaS meant to provide dynamic scaling. Unfortunately, we found we were still spending far too much time on server administration, fussing with auto scaling triggers and responding to alerts when the platform failed to scale up the number of application servers sufficiently. Our goal of minimal system administration had been overtaken by the need to keep the system up and running.
Even worse, we were lacking decent monitoring tools to identify and resolve the performance problems. There are some great Open Source tools out there like statsd and graphite, but the investment to get them up and running was more than we wanted to spend.
We had used Google App Engine for other projects and were impressed by its simplicity and stability. When the MySQL-based Google Cloud SQL service became available, we were quick to make the move.
App Engine has proved to be available and stable. Instances scale up and down with the load appropriately, without having to monkey with configuration or specify triggers through trial and error. New instances come online to serve requests in under 30 seconds, keeping request latency low even when we experience very sudden spikes in utilization.
More importantly, the strong monitoring tools have helped us quickly find and eliminate performance bottlenecks. App Engine collects logs from all running instances in near real time and has a clean interface that allows you to review and search logs, aggregated by request. This allows us to flag all requests that exceed a certain latency and drill down to the causes very quickly.
The App Engine metrics enabled us to pinpoint the MySQL queries that needed tuning, so they no longer tied up threads on the application servers. With a minimal investment of time, we now have ActivityInfo running better than ever before.
App Engine does impose some limitations in exchange for this reliability. Some of these, like the restrictions on the Java imaging libraries, we’ve been able to work around by using pure-Java libraries to render the images and PDF exports for users (See https://github.com/bedatadriven/appengine-export).
Others, like the 30-second request limit, have made us true believers. One of our problems turned out to be a few MySQL queries that worked fine in development, but degraded under load, requiring several minutes to complete. When we got hit with a few hundred of these queries concurrently, they quickly tied up all available threads on the application servers and maxed out the connection limits on MySQL, requiring manual intervention to avoid downtime. On App Engine, these cancerous requests were shut down after thirty seconds and flagged in the logs, allowing other requests to complete normally and giving us time to optimize the queries.
Our move to Google App Engine has proven to be a successful one, improving the quality of service to our users and allowing us to focus on software development.
-Contributed by Alexander Betram, Partner, Bedatadriven
0 comments:
Post a Comment