Cache

Sometimes you have to tune up your pipeline of operations but some steps are bit... expensive. Sometimes other operations actually yield the same result for a vast number of input records. This is where cache can probably help you out.

Read The Friendly Manual

The manual page for cache is quite comprehensive, so there's little to add here apart from suggesting you to actually read it. It can be helpful to do a quick recap on the processing model though:

the basic mechanism for a cache is to have a key/value association: when the key is present, the associated value is used, otherwise it is computed and saved for future reuse. This is really it in our case;
as a consequence, you need some way to determine the key and, in case, also to compute the value;
the key is either the input record itself, or derived according to the key element in the factory method;
the value is computed from another tube (the cached one).
this is a tube after all, so it is going to return record(s). For obvious reasons, a cache tube will never return an iterator, but only empty, or one single record, or records followed by an array reference with the output records inside.

Example: Tuning API Consumption

Suppose you want to create a little Markdown page for thanking people in the perl6 group in Github. There's an API for this! Let's see an example:

#!/usr/bin/env perl
# vim: sts=3 ts=3 sw=3 et ai :
use strict;
use warnings;
use 5.010;
use HTTP::Tiny;
use JSON::PP qw< decode_json >;
use Data::Tubes qw< pipeline >;

pipeline(
   \&all_pages_uris,
   \&get_uri,
   sub { return (records => decode_json(shift)); },
   sub { return {structured => shift}; },
   ['Renderer::with_template_perlish' => "Thank you, [% login %]!\n"],
   'Writer::to_files',
   {tap => 'sink'},
)->("https://api.github.com/orgs/perl6/public_members");

sub all_pages_uris {
   my $uri = shift;

   my $response = HTTP::Tiny->new()->head($uri);
   die "error in HEAD($uri): $response->{status} $response->{reason}"
     unless $response->{success};

   my $headers = $response->{headers};
   my @link_headers =
      !exists($headers->{link}) ? ()
     : ref($headers->{link})    ? (@{$headers->{link}})
     :                            ($headers->{link});

   my @uris = ($uri);
   for my $lh (@link_headers) {
      for my $link (split /,\s+/, $lh) {
         my ($uri) = $link =~ m{\A\s*<(.*)>;}mxs;
         push @uris, $uri;
      }
   }

   return (records => \@uris);
} ## end sub all_pages_uris

sub get_uri {
   state $ua = HTTP::Tiny->new();

   my $uri = shift;
   my $response = $ua->get($uri);
   die "error in HEAD($uri): $response->{status} $response->{reason}"
     unless $response->{success};
   return $response->{content};
}

Result:

shell$ ./cache-00
Thank you, Benabik!
Thank you, FROGGS!
Thank you, Heather!
Thank you, Takadonet!
Thank you, TimToady!
...

The thank-you message is indeed pretty lame. It might be good to add some twist, like a randomized message for each of them, or reminding them about their id, or... Wait a minute! Should you hit the API every time you change something in the page layout? That wouldn't be fair with your bandwidth and GitHub API!

We can initially concentrate on caching calls to get_uri, as it's where we get the bulk of the data (the initial HEAD request in all_pages_uris does not imply too much data on the wire). Hence, the evolved example changes the initial one like this:

   # instead of
   #
   # \&get_uri,
   #
   # we put this:
   [
      'Plumbing::cache' => \&get_uri,
      cache => ['!Data::Tubes::Util::Cache', repository => './'],
      key => sub {
         (my $key = shift) =~ s{\W+}{-}gmxs;
         return 'cache-' . $key;
      },
   ],

We're setting up our cache on the disk, so that we can reuse it across multiple invocations (the alternative is to set up a cache in memory as a hash, but this would make it go away when the process exits). The key is used as a filename here, so we do a bit of clenaup by removing unwanted characters and prefixing it with the string cache-, so that we avoid messing the directory up.

We can of course extend caching to the initial HEAD request, so that we can avoid hitting the network completely while we do the tests. One alternative is to put another cache around all_pages_uris, like in the following example:

pipeline(
   [
      'Plumbing::cache' => \&all_pages_uris,
      cache => ['!Data::Tubes::Util::Cache', repository => './'],
      key => sub {
         (my $key = shift) =~ s{\W+}{-}gmxs;
         return 'cache-apu-' . $key;
      },
   ],
   [
      'Plumbing::cache' => \&get_uri,
      cache => ['!Data::Tubes::Util::Cache', repository => './'],
      key => sub {
         (my $key = shift) =~ s{\W+}{-}gmxs;
         return 'cache-' . $key;
      },
   ],
   #...

You might like this approach or not. On the one hand, it separates the caching of one step from the caching at another step, which you might want for flexibility. On the other hand, in this case we're probably interested in caching all network interactions as a whole, so we can simplify it all like in the following example:

pipeline(
   [
      'Plumbing::cache' => pipeline(\&all_pages_uris, \&get_uri),
      cache => ['!Data::Tubes::Util::Cache', repository => './'],
      key => sub {
         (my $key = shift) =~ s{\W+}{-}gmxs;
         return 'cache-net-' . $key;
      },
   ],
   # ...

We simply expanded the cached tube creating a pipeline with all network interactions, so they will be cached at the same time inside a cache-net-... file. This is easy and powerful with these sizes, but can be regarded as the Achille's heel of this solution when your inputs are much more bigger, because:

the whole stage is saved in a single file;
before saving it, it is completely expanded in memory (to create all records that are saved in the file);
you will be hit from the memory point of view also when loading the cache, of course.

So, beware what you desire! The layered caching approach does still suffer from this, because it expands every iterator, but this should be a problem only in a restricted set of use cases.

You can of course decide to move additional steps inside the cached pipeline, to save yourself processing during the following invocations. In this case, you might want to move the two transforming subs and only leave the rendering and printing outside. It's really up to you.

Suggestions and Conclusions

One nice feature of cache is that it supports using CHI as the cache backend. Hence, you can use a plethora of modules available on CPAN, yay!

Using the cache is probably something useful only as a temporary resource, while you debug or tune your pipeline. There might be times where you can spot a repeated computation though, so you might benefit from caching in those cases too.

Main Site - Manual - Tubergen (download) - Wiki Home - MetaCPAN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache

Read The Friendly Manual

Example: Tuning API Consumption

Suggestions and Conclusions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally