More on Grist, EE, periods in URLs

Ryan Irelan, the guy behind EE Insider and overall reputable dude has a post summarizing an interesting recent thread (the same one I noted the other day) on his site about our recent move from ExpressionEngine to WordPress on Grist.

Not-very-related aside:  What do you notice about the URL for the post:

http://eeinsider.com/blog/feedback-on-grist.org-move/

Hey! A period!  Today I noticed that this is illegal in WordPress, though EE permits it.  (It’s illegal in WordPress in the sense that requests for URLs with periods [and terminal dashes] are rewritten without those characters.)  In any case, this may be either uninteresting or obvious to others, but rfc1738 is more liberal than I thought about what characters can appear unencoded in URLs:

Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters “;”,
“/”, “?”, “:”, “@”, “=” and “&” are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme.

Thus, only alphanumerics, the special characters “$-_.+!*’(),”, and
reserved characters used for their reserved purposes may be used
unencoded within a URL.

So neither WP nor EE are wrong in the way they do this!

Chartbeat API PHP Library Sketch

I wasn’t quite sure if there existed a PHP library for the awesome Chartbeat API. (Chartbeat is a realtime analytics engine that we use at Grist, and with which I’ve been playing around just now) so I decided to just start writing one … I’ve so far only gotten as far as erecting the framework and the one call that I happen to need now — namely quickstats().

I might keep adding to this if I have time.

/*
 * Chartbeat Library
 *
 * requires cURL
 *
 * by Matt Perry
 */

class Chartbeat {

public $apikey = '';
public $host = '';

public function __construct($apikey, $host) {

$this->apikey = $apikey;
 $this->host = $host;

if (!$apikey || !$host) {
 throw new Exception('Missing API key or Host');
 }
 //for now we require cURL ... if it's not compiled into PHP, raise exception
 if (!function_exists('curl_init')){
 throw new Exception('Curl not installed');
 return false;
 }
 }

public function quickstats($path = '') {

$url = "http://api.chartbeat.com/live/quickstats/?host={$this->host}&apikey={$this->apikey}";
 if ($path) $url .= "&$path";
 return $this->_do_request($url);
 }

//makes request, decodes json
 private function _do_request($url) {
 $result = $this->_make_call($url);
 $result = json_decode($result);
 return $result;
 }

//makes call with curl
 private function _make_call($url) {

$ch = curl_init();
 curl_setopt($ch, CURLOPT_URL, $url);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
 curl_setopt($ch, CURLOPT_TIMEOUT, $this->chartbeat_timeout);
 curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); //don't worry about cert problems

$output = curl_exec($ch);
 curl_close($ch);

return $output;
 }

}

EE community reacts to Grist.org move and more

Here’s an interesting post and comment thread on EEInsider about Grist’s move to WordPress.  While parts of the thread are sort of lame (EE is better than WP, etc etc etc …) it’s touched on some interesting themes, like scaling, NoSQL and the organization of EE’s development team.  I also learned about CE Cache for EE2 … didn’t know that cool little add-on existed.

3 Things You Can do on Sopa Pipa Blackout Day

Tomorrow is SOPA/PIPA protest day.  The US entertainment industry wants to censor the internetz so that no one can download the Lion King anymore.  While it is excessive to want to do this in the first place, what’s worse is that the industry is seeking broad powers to force ISPs and web companies regulate DNS and enforce draconian take-down orders.  If all that sounds like a horrible idea, it’s because all that is a horrible idea.

Here are three things you can do about it:

1.  Learn more  about the blackout protest .  You can even add it to your own site.

2.  Watch this cool explainer video about the bill from Fight For the Future:

3.  (IMPORTANT)  Contact your representative — I just did — it was fun!

Grist Goes WordPress

We turned on the new WordPress version of our site today at Grist!  You can read about it here and here.  Perhaps now I can take a vacation.  Or, I can go to Indiana in January.  Don’t feel too bad for me though — a bright spring awaits.

In the meantime, check out this post about the growing popularity of WordPress.

New Digs

Like a disorganized garden of good intentions, my blogging activities have become a bit fragmented over the past year or so.  This blog is an attempt to fix that — to bring everything under one roof.  Whether it’s my tumblr, my work blog, or anything else, I’ll make a good faith effort to post links to anything I post anywhere on this blog, and also to post things here that don’t really belong in any of those other places.

The actual posts on this blog — such as they are — are taken from my old, retired blog at webdevnotebook.com (now defunct.)

Other place you can find me on the interwebs:

  • twitter @mattoperry
  • my stkywll tumblr (mostly photography)
  • my work blog:  gristlabs.com

IronCache: Caching Expression Engine with memcached [UPDATED]

Update, Jan 2012:  Here’s an idea: ignore everything below and visit the IronCache page on GristLabs.  There, you can download IronCache rather than just reading about it!

I’m going to describe a general method of using memcached to implement whole-page caching for EE 1.6.x. It is somewhat complex, requires a small core change to EE and a reasonable number of configurations, and is therefore not suitable for most EE users. However, it may be of interest to anyone running a large EE site or one that is subject to intense traffic spikes. This method was suggested to us by the folks at Automattic, makers of WordPress. It was based on the Batcache plugin for wordpress, which they use in a massive way to cache their blogs at  wordpress.com.

The goal: We would like to use in-memory caching to store  ENTIRE EE PAGES (not just templates) in order to quickly serve these pages in the event of a traffic spike. We would like this caching to kick in only when necessary (ie — when a particular URL is subject to a spike) and only when the request is not associated with a logged-in session. We would like only certain types of pages to be cached (ie — not the control panel or any other page we want to be free from caching.)

Requirements: Before going any further, I’ll describe the infrastructure elements necessary for this. There are two:

1. A memcached server.
2. The PHP memcache extension.
3.  A small change to EE’s core (see below)

General Method:

We will define two types of cache objects, each of which are associated with a single URL.

COUNT: This object tracks how many times a URL has been accessed in the last N (configurable) seconds.
DATA: This object stores the the page content … it has a separate (also configurable) cache expiration time.

SESSIONS_END:

Very early in EE’s process (namely at the sessions_end hook) we check to see if several conditions are true:

- is this an anonymous session?
- does the requested URI match a defined list of patterns (configurable)
- does a non-empty (non-expired) DATA cache object exist for this URL?

If all three of these conditions are met, EE immediately fetches the complete page content from the DATA cache object, sends appropriate headers, and then the content to the browser, and terminates. This means we have a complete page displayed with very very little effort from either PHP or the database.  Yay!

However, if there is a logged in session, control is returned to the normal EE process with no changes — this means that logged in users have the same experience as always.  Also, if the URI does not match one of the defined patterns, we also proceed with vanilla EE — this prevents non-whitelisted pages from being cached.  In the case of an anonymous session and a cache-eligible URI but no (or expired) DATA object, the COUNT object is consulted. If the count is at or above a configurable threshold, then a flag is set in the $SESS->session_cache object so that the page will be cached later (more on this in a second)

Finally, whenever we encounter a cache eligible page, we increment the COUNT cache object associated with that URI.

OUTPUT_START:
note: this bit requires a small EE core change … see below.

The cache is populated at the beginning of the core.output class, just before the regular headers are sent. At this point, our extension checks for the flag in $SESS->session_cache. If it’s present, it stores the page it is about to send to the browser in the DATA object for that URI.   The result, of course, is that the next request for URI will have this cache available (subject to expiration) at sessions_end.

Components and Configuration:

This method is implemented in an extension called Ironcache, written by our team here at Grist.  A number of configurations are required:

$conf['ironcache_enable_cache'] = 'y'; //'y' means the cache is on, anything else, like 'n', mean's it not.
$conf['ironcache_cache_time'] = '300'; //in seconds -- the page cache life in seconds.
$conf['ironcache_counter_reset'] = '10'; //in seconds -- the number of seconds it takes for the counter to reset
$conf['ironcache_threshold'] = '2';  //if the page gets this number of hits in a given counter period, the page is cached.
$conf['ironcache_patterns'] = 'pattern1|pattern2|pattern3'; //a pipe-separated list of patterns for detecting cache-eligible pages
$conf['ironcache_cache_homepage'] = 'y'; // whether or not to cache '/'
$conf['ironcache_prefix'] = 'dev'; // prefix to add to cache elements.  Allows a single memcached server to be used by multiple applications without name collision
$conf['ironcache_memcache_host'] = 'localhost'; // memcached server host
$conf['ironcache_memcache_port'] = '11211'; // memcached server port

Core Change:

A small core change is required.  In core.output.php, at the beginning of the display_final_output method in core.output.php, add the following code:

// -------------------------------------------
// 'output_start' hook.
//  - override output behavior
//  - implement whole-page output caching
//
$edata = $EXT->universal_call_extension('output_start', $this);
if ($EXT->end_script === TRUE) return;
//
// -------------------------------------------

If anyone has any suggestions about how to avoid this core change, I’d love to hear them!

The Extension:

I’m planning to post a copy of the actual extension here sometime in the coming week … if you’d like a copy before then (minus some cleanup) I’d be fine emailing it to you … matt0perry [att] gmail [d0t] c0m

Ideas?  Comments?  Suggestions for Improvement?

Comment away.

Will Facebook Connect Liberalize Data Retention Terms?

TechCrunch re-rumored the possibility of changes to Facebook Connect, including changes to FB’s data retention policy for developers, which (as the post points out) are a) onerous and b) unenforceable.

Facebook Connect’s current terms of service prevent any third-party applications from storing data obtained through the API for more than 24 hours. And what is available through the API? If you authorize a thrid-party application using FBC, then basically anything in your profile is available to that application or site, including things like your avatar, your friend network and lots of biographical data. Notably, your email address is not always directly available — developers only get access to a proxy unless you explicitly grant otherwise. In any case, all of this is cacheable for only 24 hours. After that you have to refresh the data from Facebook directly or delete it.

If this were to change, there would be much rejoicing among FB developers. Lots of effort is spent making sure that data is properly refreshed.   Many would also like to see default direct access to the person’s real email address.

It’s difficult to background-spoof a local account using unstable data (which is essentially what most Facebook Connect website integrations attempt to do) especially since a real email address is not always available, and moreover cannot be relied upon for more than 24 hours. This has caused many sites (see, for example Huffington Post Social News) to ask for additional information directly from the user (and sometimes a lot of it) at the time someone connects using FB, a requirement which vastly reduces FBC’s utility as a SSO technology, but allows the site to permanently retain that data, since it did not originate from FB.

So here, O Facebook, is my wish-list: a real email address by default, and the ability to permanently cache certain data as long as the user remains connected to my app. What else should be on my wish list?