Editor's note: Join Felipe Hoffa and Michael Manoochehri tomorrow, September 6th at 11:00AM PST on Google Developers Live where they'll discuss correlation with BigQuery.
Google BigQuery is designed to make it easy to analyze large amounts of data quickly. We are always looking into how to make BigQuery even more powerful, so today we'll introduce a feature that we couldn't wait to share with you: Pearson correlation.
BigQuery is transforming the ways in which we work with massive amounts of data. Our partners have created amazing tools to make that process even more streamlined: Visualizing, slicing, and dicing. Working with your intuition and these tools, you can discover surprising new insights, analyzing terabytes of data in mere seconds. What's still a challenge is feeding this intuition, discovering where to look for insights.
The new CORR() function in BigQuery is a powerful tool for your intuition process: Which variables are similar, or have surprising behaviors? Can you rank these surprising behaviors? What are the best variables to predict the future?
Let's look at the data we collected at the Data Sensing Lab at the Moscone Center for Google I/O. We had multiple sensors in multiple rooms, collecting temperature, humidity, noise and other data during these 3 days.
To start, we can look at what rooms' temperature behaved in a similar way during the 2nd and the 3rd day:
SELECT CORR(a.data, b.data) corr, a.room room, count(*) c
FROM (
SELECT
TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(DATA) data, room
FROM [io_sensor_data.moscone_io13]
WHERE
DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16'
AND sensortype='temperature'
GROUP EACH BY time, room) a
JOIN EACH (
SELECT
TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(data) data, room
FROM [io_sensor_data.moscone_io13]
WHERE
DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-17'
AND sensortype='temperature'
GROUP EACH BY time, room) b
ON a.time=b.time AND a.room = b.room
GROUP EACH BY room
HAVING
corr IS NOT NULL
AND c > 800
ORDER EACH BY corr DESC
corr | room | c |
---|---|---|
0.9387693711 | sf desk 1st floor | 1331 |
0.8488553811 | chrome east | 1418 |
0.8423597116 | chrome hobbit | 1372 |
0.8162574011 | chrome west | 1401 |
0.7696065852 | chrome north | 1374 |
... | ... | ... |
-0.1048712561 | Room 1 | 1390 |
-0.1508345595 | keynote crowd | 1358 |
-0.5467798237 | android east | 1402 |
This table says that many rooms behaved in a similar way during both days. For example, a room inside the main exhibition area:
We see that the temperature rose during the day, in a very similar way. A/C helps a lot with this. Meanwhile in the main reception area the temperature dropped considerably when the doors opened:
Even though the base temperature was considerably higher on Day 2, CORR() finds out that the behavior was very similar.
The room 'android east' shows a highly negative correlation. We can visualize why:
Turns out something was wrong with the sensors there a few hours during Day 2, and went to 0 during Day 3. It's too late now to go and fix them, but we can keep this as a good example of how CORR() can help us set alarms and notify on unusual events.
We can repeat this experiment with other dimensions, like humidity or noise. But we can go further, and find out what dimensions correlate with each other.
SELECT CORR(a.data, b.data) corr, a.sensortype a_sensortype, b.sensortype b_sensortype, a.room room, count(*) c
FROM (
SELECT
TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time,
AVG(DATA) data, room, sensortype
FROM [io_sensor_data.moscone_io13]
WHERE
DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16'
AND sensortype != 'pressure' AND sensortype != 'altitude' AND room != 'None'
GROUP EACH BY time, room, sensortype) a
JOIN EACH (
SELECT
TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time,
AVG(data) data, room, sensortype
FROM [io_sensor_data.moscone_io13]
WHERE
DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16'
AND sensortype != 'pressure' AND sensortype != 'altitude' AND room != 'None'
GROUP EACH BY time, room, sensortype) b
ON a.time=b.time AND a.room = b.room
WHERE a.sensortype > b.sensortype
GROUP EACH BY room, a_sensortype, b_sensortype
HAVING
corr IS NOT NULL
AND c > 800
ORDER EACH BY corr DESC
corr | a_sensortype | b_sensortype | room | c |
---|---|---|---|---|
0.9452553258 | temperature | humidity | android east | 1436 |
0.9230651809 | temperature | gas | sf desk 1st floor | 1425 |
... | ... | ... | ... | ... |
0.7949466754 | temperature | mic | maps | 1434 |
... | ... | ... | ... | ... |
0.0013869924 | temperature | light | Room 12 | 1436 |
... | ... | ... | ... | ... |
-0.9396195218 | temperature | humidity | accesability | 1341 |
-0.9564087934 | temperature | humidity | chrome lounge | 1135 |
-0.9925864599 | temperature | humidity | sf desk 1st floor | 1437 |
We can visualize how the air quality in the reception improved, once the doors were opened - as the drop in temperature shows:
Or how the temperature rose at the same time as the volume levels, in the Maps area:
Temperature and light with almost no correlation in Room 12 calls our attention, as we got some very bad measurements in that room:
Temperatures of over 2000? Something went wrong.
At the end of the table, it's a surprise that while humidity had a high negative correlation to temperature in many sectors, it had a high positive correlation in 'android east'. First we can look at the normal behavior in the 'chrome lounge sector':
And the unusual behavior at 'android east':
Turns out there is a highly unusual positive correlation because both sensors went to 0 at the same time.
That's the nature of real data. It's a hard job to keep it accurate and clean. CORR() is a tool that can also call our attention to what is going wrong.
We can go much further with this: What's the relationship between rooms? What sensors in which rooms can help us predict what will happen an hour later on a different room? For this and much more tune in for an upcoming Google Developers Live session where we'll discuss how we built this queries, and how to go beyond. Stay tuned!
-Posted by Felipe Hoffa, Developer Programs Engineer
0 comments:
Post a Comment