Plotting Arduino Temperature Logs #2 - Using Highcharts, d3, and Queue for Asynchronous data merging and plotting
So in my previous post on building an Arduino weather station I'd collected roughly half a million data points for minute temperature data over 2015, and plotted a days worth with d3. This time I decided to use Highcharts.js to display the entire dataset at dynamic resolutions of minutes, hours and days. Click here or on the image for a live demo.
Then given a range and resolution, the simplest cases are when it's hourly and daily data, since it's all stored in a single CSV which we can pull directly. After that we return a sliced subset of the data in the correct datetime range.
Demo of a dynamic Highchart showing 400,000 data points via asynchronous loading. |
How do we do this? We could plot all 400,000 points at once, but that would be a bad idea for a few reasons. First most screens don't have 400,000 pixels, so we couldn't see all the points, second, humans don't really parse 5 million points of data, we generalize. Finally, sending all 400,000 points at once is a lot of data to throw out to networks and our visualization system.
So what can we do? A simple solution is to average and sample the data in larger increments. We can take minute data and average it into hourly data, with the added bonus of tracking min and max values over the hours. Similarly we can generate other groups such as daily or even weekly.
Generate hourly and daily grouped CSV data with Python
Let's do that first, in Python we write a script that runs through all the temperature logs (1440 entries per file, 1 file per day, we have around 270 days worth so far). It converts the logs into a CSV format that d3.js can easily parse, as well as generates the aggregate group data and saves those to files as well. Since hourly data for a year would only be about 8,760 entries, it's small enough that we can just store that in one file for simplicity. Here is some stripped code showing how we load each file, parse data as well as build up our hourly and daily data.
The key part within there is to generate hourly data, we sum up the data from every minute within the hour:
# For each file for input_filepath in input_file_list: # Writing minute data to daily csv with open(csv_filepath, 'w') as f_csv: # For each line in input file for line in open(input_filepath,'r'): linedata = line.strip().split('\t') linedata[0] = formatDate(linedata[0]).isoformat() f_csv.write("%s\n" % (",".join(linedata))) # Add to hourly bin temp = float(linedata[2]) data_hourly[timestamp.hour,0] = min(temp, data_hourly[timestamp.hour,0]) data_hourly[timestamp.hour,1] += temp data_hourly[timestamp.hour,2] = max(temp, data_hourly[timestamp.hour,2]) data_hourly_cnt[timestamp.hour] += 1 # Divide sum by counts to get mean data_hourly[data_hourly_cnt!=0,1] /= data_hourly_cnt[data_hourly_cnt!=0] data_hourly[data_hourly_cnt==0,1]=0 # Write hourly data to file for hour in range(24): msg = "%s,%.2f,%.2f,%.2f\n" % ((file_date+timedelta(hours=hour)).isoformat(), data_hourly[hour, 0], data_hourly[hour, 1], data_hourly[hour, 2]) f_hourly.write(msg) # Write daily message to file msg = "%s,%.2f,%.2f,%.2f\n" % ((file_date).isoformat(), data_hourly[:, 0].min(0), data_hourly[:, 1].mean(0), data_hourly[:, 2].max(0)) f_daily.write(msg)
The key part within there is to generate hourly data, we sum up the data from every minute within the hour:
data_hourly[timestamp.hour,0] = min(temp, data_hourly[timestamp.hour,0]) data_hourly[timestamp.hour,1] += temp data_hourly[timestamp.hour,2] = max(temp, data_hourly[timestamp.hour,2]) data_hourly_cnt[timestamp.hour] += 1
And then when we've collected all the points for the hour, divide by the count to get our averages
Okay so now that we've got our aggregate groups, we need our chart to pull from the right data source. As a first pass, we could just be careful and choose ranges that only go over a few days at most, but a better idea would be to check the range and if it's beyond a certain threshold (say max number of entries > N) we choose from the hourly aggregate csv files instead, similarly for daily.
# Divide sum by counts to get mean data_hourly[data_hourly_cnt!=0,1] /= data_hourly_cnt[data_hourly_cnt!=0] data_hourly[data_hourly_cnt==0,1]=0
Okay so now that we've got our aggregate groups, we need our chart to pull from the right data source. As a first pass, we could just be careful and choose ranges that only go over a few days at most, but a better idea would be to check the range and if it's beyond a certain threshold (say max number of entries > N) we choose from the hourly aggregate csv files instead, similarly for daily.
Loading arbitrary time ranges
Yeah okay that sounds reasonable, let's get to it. Also how do we deal with multiple sources of data? Not only do we need to decide the right resolution (minute/hour/day) but the data we need for a given datetime range might be spread among several files. To get the right data, we need a function that given a date range can choose the right resolution, the full set of files needed, merge and clip the data to the right datetime range, and return that. We can start by building a wrapper that can decide the appropriate resolution./* Wrapper to get data in time range, determines resolution first, * Data resolution is either minutes, hours, or days, depending on range * such that number of points below pointThreshold * set_data_fn is the callback passed the data that is returned */ function getTempsInRange(date_min, date_max, set_data_fn) { console.log("Asked for range", new Date(date_min) + " <--> " + new Date(date_max)); var diffMs = date_max - date_min; // Delta in milliseconds var diffMinutes = diffMs / (1000 * 60); // Delta in minutes var diffHours = diffMinutes / 60; // Delta in hour console.log((diffHours / 24).toFixed(2) + " days,", diffMinutes.toFixed(2) + " minutes,", diffHours.toFixed(2) + " hours in range"); // Determine resolution var resolution; if (diffHours > pointThreshold) { resolution = DATA_RESOLUTION.day; console.log("Resolution : day"); } else if (diffMinutes > pointThreshold) { resolution = DATA_RESOLUTION.hour; console.log("Resolution : hour"); } else { resolution = DATA_RESOLUTION.minute; console.log("Resolution : minute"); } getTempsInRange_(date_min, date_max, resolution, set_data_fn); }
Then given a range and resolution, the simplest cases are when it's hourly and daily data, since it's all stored in a single CSV which we can pull directly. After that we return a sliced subset of the data in the correct datetime range.
if (resolution == DATA_RESOLUTION.day) { timeDelta = 1000 * 60 * 60 * 24; console.log("Loading daily."); // Just use all the daily since it's only 274 values d3.csv('../csv/R2015_daily.csv', d3_callback); } else if (resolution == DATA_RESOLUTION.hour) { timeDelta = 1000 * 60 * 60; console.log("Loading hourly."); d3.csv('../csv/R2015_hourly.csv', function(dataAll) { // Gets all hourly data, slice up first var min_index = bisectDateLow(dataAll, date_min); var max_index = bisectDateHigh(dataAll, date_max); var data = dataAll.slice(min_index, max_index); console.log("Returning", data.length, "hourly data points"); d3_callback(data); }); }
The tricky part comes when our resolution is in minutes, and our datetime range is over multiple days, like say from monday afternoon to wednesday morning. The data needed is contained over 3 files. It seems simple enough to just load all 3 files and merge them together, but loading JSON data over the network is asynchronous, and we need to wait for all 3 (or whatever) tasks to finish before we can merge the data.
Luckily Queue.js (thanks mbostock!) nicely provides exactly this ability to defer an operation until all operations are finished, in this case the operations being several d3.csv calls to load data.
Luckily Queue.js (thanks mbostock!) nicely provides exactly this ability to defer an operation until all operations are finished, in this case the operations being several d3.csv calls to load data.
else if (resolution == DATA_RESOLUTION.minute) { console.log("Loading by the minute."); loadCombinedMinuteCSVs(date_min, date_max, set_data_fn); }
The magic happens in loadCombinedMinuteCSVs which is a bit too big to show here so I linked it. The key deferred loading is like so:
for (var cdate = sdate; cdate <= edate; cdate += oneday) { var currfilename = "../csv/R" + (new Date(cdate)).toString("yyyy_MM_dd") + ".csv"; q.defer(d3.csv, currfilename); }; q.awaitAll(combine);
After that d3.merge can be used to combine the datasets into one large one. Before or after that, we can also use d3.bisect on date ranges to find the index of the start and end points to the minute, and slice the arrays to get the exact date and time range requested. Merging with d3 is anti-climatically simple: dataset = d3.merge(allData);
After all that, Highcharts beautifully handles all the dynamic slicing of data in a visually appealing way. There were some small gotchas handling NaN data, but eventually we end up with our live final demo.
Source Code on GitHub
Source Code on GitHub