Cloud World

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Tuesday, 11 June 2013

Google BigQuery gets bigger, faster, and smarter with big result sets and new analytics functions

Posted on 08:00 by Unknown
Google BigQuery is designed to make it easy to analyze large amounts of data quickly. Today we announced several updates that give BigQuery the ability to handle arbitrarily large result sets, use window functions for advanced analytics, and cache query results. You are also getting new UI features, larger interactive quotas, and a new convenient tiered pricing scheme. In this post we'll dig further into the technical details of these new features.



Large results

BigQuery is able to process terabytes of data, but until today BigQuery could only output up to 128 MB of compressed data per query. Many of you asked for more and from now on BigQuery will be able to output results as large as the largest tables our customers have ever had.



To get this benefit, you should enable the new "--allow_large_results" flag when issuing a query job, and specify a destination table. All results will be saved to the new specified table (or appended, if the table exists). In the updated web UI these options can be found under the new "Enable Options" menu.



With this feature, you can run big transformations on your tables, plus get big subsets of data to further analyze from the new table.



Window functions

BigQuery's power is in the ability to interactively run aggregate queries over terabytes of data, but sometimes counts and averages are not enough. That's why BigQuery also lets you calculate quantiles, variance and standard deviation, as well as other advanced functions.



To make BigQuery even more powerful, today we are adding support for window functions (also known as "analytical functions") for ranking, percentiles, and relative row navigation. These new functions give you different ways to rank results, explore distributions and percentiles, and traverse results without the need for a self join.



To introduce these functions with an advanced example, let's use the dataset we collected from the Data Sensing Lab at Google I/O. With the percentile_cont() function it's easy to get the median temperature over each room:

SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
In this example, each original data row shows the median temperature for each room. To visualize it better, it's a good idea to group all results by room with an outer query:

SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
We can add an additional outer query, to rank the rooms according to which one had the coldest median temperature. We'll use one of the new ranking window functions, dense_rank():

SELECT DENSE_RANK() OVER (ORDER BY median) rank, median, room FROM (
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
)
We've updated the documentation with descriptions and examples for each of the new window functions. Note that they require the OVER() clause, with an optional PARTITION BY and sometimes required ORDER BY arguments. ORDER BY tells the window function what criteria to use to rank items, while PARTITION BY allows you to define multiple groups to be analyzed independently of each other.



The window functions don't work with the big GROUP EACH BY and JOIN EACH BY operators, but they do work with the traditional GROUP BY and JOIN BY. As a reminder, we announced GROUP EACH BY and JOIN EACH BY last March, to allow large join and group operations.



Query caching

BigQuery now remembers values that you've previously computed, saving you time and the cost of recalculating the query. To maintain privacy, queries are cached on a per-user basis. Cached results are only returned for tables that haven't changed since the last query, or for queries that are not dependent on non-deterministic parameters (such as the current time). Reading cached results is free, but each query still counts against the max number of queries per day quota.

Query results are kept cached for 24 hours, on a best effort basis.

You can disable query caching with the new flag --use_cache in bq, or "useQueryCache" in the API. This feature is also accessible with the new query options on the BigQuery Web UI.



BigQuery Web UI: Query validator, cost estimator, and abandonment

The BigQuery UI gets even better: You'll get instant information while writing a query if its syntax is valid. If the syntax is not valid, you'll know where the error is. If the syntax is valid, the UI will inform you how much the query would cost to run. This feature is also available with the bq tool and API, using the --dry_run flag.

An additional improvement: When running queries on the UI, previously you had to wait until its completion before starting another one. Now you have the option to abandon it, to start working on the next iteration of the query without waiting for the abandoned one.



Pricing updates

Starting in July, BigQuery pricing becomes more affordable for everyone: Data storage costs are going from $0.12/GB/month to $0.08/GB/month. And if you are a high-volume user, you'll soon be able to opt-in for tiered query pricing, for even better value.



Bigger quota

To support larger workloads we're doubling interactive query quotas for all users, from 200GB + 1 concurrent query, to 400 GB of concurrent queries + 2 additional queries of unlimited size.

These updates make BigQuery a faster, smarter, and even more affordable solution for ad hoc analysis of extremely large datasets. We expect they'll help to scale your projects, and we hope you'll share your use cases with us on Google+.



The BigQuery UI features a collection of public datasets for you to use when trying out these new features. To get started, visit our sign-up page and Quick Start guide. You should take a look at our API docs, and ask questions about BigQuery development on Stack Overflow. Finally, don't forget to give us feedback and join the discussion on our Cloud Platform Google+ page.



- Posted by Felipe Hoffa, Developer Programs Engineer
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Bridging Mobile Backend as a Service to Enterprise Systems with Google App Engine and Kinvey
    The following post was contributed by Ivan Stoyanov , VP of Engineering for Kinvey, a mobile Backend as a Service provider and Google Cloud ...
  • Tutorial: Adding a cloud backend to your application with Android Studio
    Android Studio lets you easily add a cloud backend to your application, right from your IDE. A backend allows you to implement functionality...
  • 2013 Year in review: topping 100,000 requests-per-second
    2013 was a busy year for Google Cloud Platform. Watch this space: each day, a different Googler who works on Cloud Platform will be sharing ...
  • Easy Performance Profiling with Appstats
    Since App Engine debuted 2 years ago, we’ve written extensively about best practices for writing scalable apps on App Engine. We make writ...
  • TweetDeck and Google App Engine: A Match Made in the Cloud
    I'm Reza and work in London, UK for a startup called TweetDeck . Our vision is to develop the best tools to manage and filter real time ...
  • Scaling with the Kindle Fire
    Today’s blog post comes to us from Greg Bayer of Pulse , a popular news reading application for iPhone, iPad and Android devices. Pulse has ...
  • Who's at Google I/O: Mojo Helpdesk
    This post is part of Who's at Google I/O , a series of guest blog posts written by developers who are appearing in the Developer Sandbox...
  • A Day in the Cloud, new articles on scaling, and fresh open source projects for App Engine
    The latest release of Python SDK 1.2.3, which introduced the Task Queue API and integrated support for Django 1.0, may have received a lot ...
  • SendGrid gives App Engine developers a simple way of sending transactional email
    Today’s guest post is from Adam DuVander, Developer Communications Director at SendGrid. SendGrid is a cloud-based email service that deliv...
  • Qubole helps you run Hadoop on Google Compute Engine
    This guest post comes form Praveen Seluka, Software Engineer at Qubole, a leading provider of Hadoop-as-a-service.  Qubole is a leading pr...

Categories

  • 1.1.2
  • agile
  • android
  • Announcements
  • api
  • app engine
  • appengine
  • batch
  • bicycle
  • bigquery
  • canoe
  • casestudy
  • cloud
  • Cloud Datastore
  • cloud endpoints
  • cloud sql
  • cloud storage
  • cloud-storage
  • community
  • Compute Engine
  • conferences
  • customer
  • datastore
  • delete
  • developer days
  • developer-insights
  • devfests
  • django
  • email
  • entity group
  • events
  • getting started
  • google
  • googlenew
  • gps
  • green
  • Guest Blog
  • hadoop
  • html5
  • index
  • io2010
  • IO2013
  • java
  • kaazing
  • location
  • mapreduce
  • norex
  • open source
  • partner
  • payment
  • paypal
  • pipeline
  • put
  • python
  • rental
  • research project
  • solutions
  • support
  • sustainability
  • taskqueue
  • technical
  • toolkit
  • twilio
  • video
  • websockets
  • workflows

Blog Archive

  • ▼  2013 (143)
    • ►  December (33)
    • ►  November (15)
    • ►  October (17)
    • ►  September (13)
    • ►  August (4)
    • ►  July (15)
    • ▼  June (12)
      • Google Cloud Platform Powers Minyanville’s Buzz an...
      • Tutorial: Adding a cloud backend to your applicati...
      • Get Coding Faster Thanks to Little Green Buttons
      • Enabling Google App Engine to run in the Private C...
      • Enabling offline disk import for Google Cloud Storage
      • Google App Engine 1.8.1 Released
      • Cube Slam meets Google Cloud Platform
      • Google BigQuery gets bigger, faster, and smarter w...
      • Building Google Apps Extensions running on Google ...
      • Bridging Mobile Backend as a Service to Enterprise...
      • Cloud SQL API: YOU get a database! And YOU get a d...
      • Get your mobile application backed by the cloud wi...
    • ►  May (15)
    • ►  April (4)
    • ►  March (4)
    • ►  February (9)
    • ►  January (2)
  • ►  2012 (43)
    • ►  December (2)
    • ►  November (2)
    • ►  October (8)
    • ►  September (2)
    • ►  August (3)
    • ►  July (4)
    • ►  June (2)
    • ►  May (3)
    • ►  April (4)
    • ►  March (5)
    • ►  February (3)
    • ►  January (5)
  • ►  2011 (46)
    • ►  December (3)
    • ►  November (4)
    • ►  October (4)
    • ►  September (5)
    • ►  August (3)
    • ►  July (4)
    • ►  June (3)
    • ►  May (8)
    • ►  April (2)
    • ►  March (5)
    • ►  February (3)
    • ►  January (2)
  • ►  2010 (38)
    • ►  December (2)
    • ►  October (2)
    • ►  September (1)
    • ►  August (5)
    • ►  July (5)
    • ►  June (6)
    • ►  May (3)
    • ►  April (5)
    • ►  March (5)
    • ►  February (2)
    • ►  January (2)
  • ►  2009 (47)
    • ►  December (4)
    • ►  November (3)
    • ►  October (6)
    • ►  September (5)
    • ►  August (3)
    • ►  July (3)
    • ►  June (4)
    • ►  May (3)
    • ►  April (5)
    • ►  March (3)
    • ►  February (7)
    • ►  January (1)
  • ►  2008 (46)
    • ►  December (4)
    • ►  November (3)
    • ►  October (10)
    • ►  September (5)
    • ►  August (6)
    • ►  July (4)
    • ►  June (2)
    • ►  May (5)
    • ►  April (7)
Powered by Blogger.

About Me

Unknown
View my complete profile