-
Notifications
You must be signed in to change notification settings - Fork 0
Cache
Sometimes you have to tune up your pipeline of operations but some steps are bit... expensive. Sometimes other operations actually yield the same result for a vast number of input records. This is where cache can probably help you out.
The manual page for cache is quite comprehensive, so there's little to add here apart from suggesting you to actually read it. It can be helpful to do a quick recap on the processing model though:
- the basic mechanism for a cache is to have a key/value association: when the key is present, the associated value is used, otherwise it is computed and saved for future reuse. This is really it in our case;
- as a consequence, you need some way to determine the key and, in case, also to compute the value;
- the key is either the input record itself, or derived according to the
keyelement in the factory method; - the value is computed from another tube (the cached one).
- this is a tube after all, so it is going to return record(s). For
obvious reasons, a cache tube will never return an iterator, but only
empty, or one single record, or
recordsfollowed by an array reference with the output records inside.
Suppose you want to create a little Markdown page for thanking people in the perl6 group in Github. There's an API for this! Let's see an example:
#!/usr/bin/env perl
# vim: sts=3 ts=3 sw=3 et ai :
use strict;
use warnings;
use 5.010;
use HTTP::Tiny;
use JSON::PP qw< decode_json >;
use Data::Tubes qw< pipeline >;
pipeline(
\&all_pages_uris,
\&get_uri,
sub { return (records => decode_json(shift)); },
sub { return {structured => shift}; },
['Renderer::with_template_perlish' => "Thank you, [% login %]!\n"],
'Writer::to_files',
{tap => 'sink'},
)->("https://api.github.com/orgs/perl6/public_members");
sub all_pages_uris {
my $uri = shift;
my $response = HTTP::Tiny->new()->head($uri);
die "error in HEAD($uri): $response->{status} $response->{reason}"
unless $response->{success};
my $headers = $response->{headers};
my @link_headers =
!exists($headers->{link}) ? ()
: ref($headers->{link}) ? (@{$headers->{link}})
: ($headers->{link});
my @uris = ($uri);
for my $lh (@link_headers) {
for my $link (split /,\s+/, $lh) {
my ($uri) = $link =~ m{\A\s*<(.*)>;}mxs;
push @uris, $uri;
}
}
return (records => \@uris);
} ## end sub all_pages_uris
sub get_uri {
state $ua = HTTP::Tiny->new();
my $uri = shift;
my $response = $ua->get($uri);
die "error in HEAD($uri): $response->{status} $response->{reason}"
unless $response->{success};
return $response->{content};
}Result:
shell$ ./cache-00
Thank you, Benabik!
Thank you, FROGGS!
Thank you, Heather!
Thank you, Takadonet!
Thank you, TimToady!
...
The thank-you message is indeed pretty lame. It might be good to add some twist, like a randomized message for each of them, or reminding them about their id, or... Wait a minute! Should you hit the API every time you change something in the page layout? That wouldn't be fair with your bandwidth and GitHub API!
We can initially concentrate on caching calls to get_uri, as it's where
we get the bulk of the data (the initial HEAD request in
all_pages_uris does not imply too much data on the wire). Hence, the
evolved example changes the initial one like this:
# instead of
#
# \&get_uri,
#
# we put this:
[
'Plumbing::cache' => \&get_uri,
cache => ['!Data::Tubes::Util::Cache', repository => './'],
key => sub {
(my $key = shift) =~ s{\W+}{-}gmxs;
return 'cache-' . $key;
},
],We're setting up our cache on the disk, so that we can reuse it across
multiple invocations (the alternative is to set up a cache in memory as a
hash, but this would make it go away when the process exits). The key is
used as a filename here, so we do a bit of clenaup by removing unwanted
characters and prefixing it with the string cache-, so that we avoid
messing the directory up.
We can of course extend caching to the initial HEAD request, so that we
can avoid hitting the network completely while we do the tests. One
alternative is to put another cache around all_pages_uris, like in the
following example:
pipeline(
[
'Plumbing::cache' => \&all_pages_uris,
cache => ['!Data::Tubes::Util::Cache', repository => './'],
key => sub {
(my $key = shift) =~ s{\W+}{-}gmxs;
return 'cache-apu-' . $key;
},
],
[
'Plumbing::cache' => \&get_uri,
cache => ['!Data::Tubes::Util::Cache', repository => './'],
key => sub {
(my $key = shift) =~ s{\W+}{-}gmxs;
return 'cache-' . $key;
},
],
#...You might like this approach or not. On the one hand, it separates the caching of one step from the caching at another step, which you might want for flexibility. On the other hand, in this case we're probably interested in caching all network interactions as a whole, so we can simplify it all like in the following example:
pipeline(
[
'Plumbing::cache' => pipeline(\&all_pages_uris, \&get_uri),
cache => ['!Data::Tubes::Util::Cache', repository => './'],
key => sub {
(my $key = shift) =~ s{\W+}{-}gmxs;
return 'cache-net-' . $key;
},
],
# ...We simply expanded the cached tube creating a pipeline with all network
interactions, so they will be cached at the same time inside a
cache-net-... file. This is easy and powerful with these sizes, but can
be regarded as the Achille's heel of this solution when your inputs are
much more bigger, because:
- the whole stage is saved in a single file;
- before saving it, it is completely expanded in memory (to create all records that are saved in the file);
- you will be hit from the memory point of view also when loading the cache, of course.
So, beware what you desire! The layered caching approach does still suffer from this, because it expands every iterator, but this should be a problem only in a restricted set of use cases.
You can of course decide to move additional steps inside the cached
pipeline, to save yourself processing during the following invocations. In
this case, you might want to move the two transforming subs and only
leave the rendering and printing outside. It's really up to you.
One nice feature of cache is that it supports using CHI as the cache backend. Hence, you can use a plethora of modules available on CPAN, yay!
Using the cache is probably something useful only as a temporary resource, while you debug or tune your pipeline. There might be times where you can spot a repeated computation though, so you might benefit from caching in those cases too.