-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCloud.txt
More file actions
2569 lines (2022 loc) · 82.6 KB
/
Cloud.txt
File metadata and controls
2569 lines (2022 loc) · 82.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
// Notes on Cloud Computing, Distributed Systems, SDI (System Design
// Interviews) and all the High Scalability, Web Scaling talks from YouTube
@@@@@@@@@@@@@ Cloud Computing Concepts Part 1 @@@@@@@@@@@@@@
Week 1
Topics
* Intro to Cloud Computing (history, challenges, economics etc)
* MapReduce
- AWS
* EC2: Elastic Compute Cloud
* S3: Simple Storage Service
* EBS: Elastic Block Storage
- 4 Features New in Today's Cloud
* Massive Scale
* On-Demand Access: Pay as you go model.
* Data intensive nature
* New cloud programming paradigms
- Energy usage:
WUE (Water Usage Efficiency) = Annual Water Usage/IT Equipment Energy
(L/kWh) - low is good.
PUE (Power Usage Efficiency) = Total Facility Power/IT Equipment Power
- MapReduce:
* Map
* Reduce
* Resource Manager (for Hadoop, YARN is the resource mgr)
- YARN Scheduler = Yet Another Resource Negotiator
+ Treats each server as collection of 'containers'
~ Container = CPU + Memory (not same as Docker container)
+ 3 main components
~ Global Resource Manager (RM) for scheduling
~ Per-server Node Manager (NM) for server specific functions
~ Per-application Application Master (AM).
* Responsible for negotiation with RM and NM
* Detecting task failures of that job
- MapReduce Fault Tolerance
+ Heartbeats are periodically sent across managers for health checks.
+ Stragglers: Slow execution.
~ Speculative Execution: If % executed is very low, a replicate
execution is started on a new VM/container.
- Building blocks of Distributed Systems
+ Gossip and Epidemic Protocols
+ Failure Detection and Membership Protocols
- Grid computing is pre-cursor to cloud computing.
Week 2
Topics
* Gossip & Epidemic protocols
* Group Membership, it's applications
Multicast Problem
---
- Multicast protocol must be fault-tolerant and scalable.
- IP multicast is not attractive because it MAY not be implemented in
underlying routers/servers.
- This is application level multicast protocol.
How to solve?
---
Centralized: Sender has a list of destination node. Opens socket and sends
TCP/UDP packets.
- Problem?
* fault-tolerance: It is not.
* more latency (longer time)
Tree based multicast protocols:
+ eg., IP multicast, SRM, RMTP, TRAM, TMTP
- Problems? If nodes fail, re-create the tree.
- Spanning trees are built among nodes. This is used to disseminate messages.
- Use ACKs or NACKs to repair unreceived multicast packets.
SRM (Scalable Reliable Multicast):
+ Uses NACKs
+ Uses random delays and exponential backoff to avoid NACK storms
RMTP (Reliable Multicast Transport Protocol):
+ Uses ACKs
+ ACKs sent to designated receivers, which then re-transmit missing
multicasts.
- These protocols still cause O(N) ACK/NACK overhead.
Gossip Protocol (Epidemic Multicast)
---
- Sender randomly picks 'b' random targets and sends Gossip Messages.
- 'b' is called Gossip fan out. Typically 2 or so targets.
- When nodes receive the Gossip, it is 'infected' by Gossip.
- Receiver starts doing the same.
- There may be duplicate received by nodes.
- "Infected Nodes" - those that received multicast message.
- "Uninfected Nodes" - those that did NOT receive multicast message.
Push vs Pull
---
- Gossip protocol is a push protocol.
- If multiple multicast messages exist, gossip a subset of them.
- There's a "Pull" gossip.
+ Periodically poll a few randomly selected targets for new multicast
messages that you haven't received.
- There's a hybrid variant too: push-pull.
Gossip Analysis
---
- Push based Gossip protocol is
+ lightweight in large groups.
+ spreads multicast quickly.
+ is highly fault-tolerant.
- Analysis is based on mathematical branch of Epidemiology.
+ Population of n+1 individuals mixing homogeneously.
+ Contact rate between any individual pair is B
- TODO: Watch videos for analysis.
Group Membership
---
- Failure rate are norm in datacenters.
- Say, the rate of failure of one machine is 10 years (120 months). In a DC with
120 servers, Mean Time To Failure (MTTF) is 1 month. MTTF reduces with more
servers.
- Two Sub Protocols are essential in Group Membership.
+ Failure Detection
+ Info Dissemination
Failure Detectors
---
- Frequency of failures goes up linearly with size of DC.
- Properties of Failure Detectors
+ Completeness - each failure is detected.
+ Accuracy - no mistaken detection.
+ Speed - time to first detection of failure.
+ Scale - load balance each member.
- Completeness + Accuracy = Very hard to acheive. Consensus problem in
distributed systems.
- What *really* happens?
+ Completeness - Gauranteed
+ Accuracy - partial/probabilistic gaurantee.
Failure Detector Protocols
Centralized Heartbeating.
---
- A process receives heartbeat from all other processes. After timeout, a
process is marked failed.
- A process can be overloaded with heartbeats. Hotspot problem.
Ring Heartbeating
---
- Form a ring. Processes send/receive heartbeats to its neighbors.
- Multiple simultaneous failures causes problems.
All-to-All Heartbeating
---
- All processes send to all other processes.
- Has equal load per member.
- Protocol is complete.
- Too many messages though.
Gossip-Style Membership
---
- Nodes maintain a membership list with 3 values:
+ IDs of Nodes.
+ Heartbeat counter. A counter unique to a node.
+ Local time when heartbeat was received from a node.
- Each node randomly shares this membership list with peers. Peers can update
their membership list based on this message.
- There's a timeout if heartbeat is not received from a node, called T_fail.
- After a wait time of T_cleanup (mostly same as T_fail), the entry is deleted.
- If a heartbeat is received *directly* from the node then entry is added back
to membership table.
- If membership tables are received from other nodes in T_fail time and before
T_cleanup time, not updated.
Which is best failure detector?
---
- Failure Detector Properties:
+ Completeness - Guarantee always
+ Accuracy - Probability Mistake in T time units: PM(T)
+ Speed - T time units (time to first failure detection)
+ Scale - N x L (nodes x load per node)
- All-to-All heartbeating:
+ Each node sends N heartbeats every T time units. Load = N/T
+ P_i is heartbeat sequence, then i++
- Gossip-Style:
+ tg = gossip period.
+ T = time when all nodes receive gossip.
+ T = (log N) x tg
+ L = Load per node = N/tg = (N x (log N))/T
+ Note that Gossip-Style heartbeat has higher load than All-to-All
heartbeating. That's because Gossip is trying better accuracy by using
more messages.
- What's best/optimal?
+ Find out worst case load (L*) per member as a function of T, PM(T), N
and Independent Message Loss Probability (P_ml)
L* = (1/T) * (log(PM(T))/log(P_ml))
+ Note that worst case load is independent of N.
SWIM Failure Detection Protocol
---
- SWIM follows opposite technique of Heart beating. It uses ping.
- Each protocol period = T time units. Failure detection happens here.
Protocol:
- Process P_i sends direct ping to P_j. If P_j ACKS then no failure, else,
- P_i sends indirect ping to P_j by via K random processes. If any one process
get response from P_j, it ACKS to P_i then no failure
- Otherwise, P_j is marked as failed
- TODO: Watch videos 2.5 & 2.6 of Week 2 for analysis.
Grid Computing
---
- Scheduling jobs on the Grid is one problem.
- Globus Protocol - inter-site job scheduling protocol
- HTCondor Protocol (High Through Condor) - intra-site job scheduling and
monitoring protocol.
- Globus Alliance - bunch of univs and research labs.
- Globus Toolkit: Lot of tools (free and open-source) for Grid computing.
- Is Grid Computing and Cloud Computing converging? It's an open question.
Week 3
Topic
* P2P systems (Napster, Gnutella, Fasttrack, Bittorrent, Chord, Pastry,
Kelips)
P2P Systems Intro
---
Napster
---
+ Each user runs a client called Peers.
+ When a file is uploaded, it stays local.
+ Meta data about the file is uploaded to Servers (file name, IP, port number,
artists name etc).
+ Servers don't save files (hence no legal liability).
+ When someone searches for a file, Server responds with data.
+ Clients then establish connection and transfer files. Server is not involved.
+ Message exchanges using TCP sockets.
+ Every Server is root of a Ternary tree, with Peers as children.
How do Peers join Servers?
---
+ Send a http request to well-known URL like: www.myp2pservice.com
Problems
---
+ Centralized server a source of congestion, single point of failure, no
security.
+ Indirect infringment of copyright law.
Gnutella
---
+ There's no Servers. Clients themselves act as Servers to find and retrieve
data. It's called "Servants"
+ Peers stores files locally and have info about neighbor Peers.
+ This forms an Overlay Graph.
+ Gnutella protocol has 5 main message types:
- Query (Search)
- QueryHit (response to search)
- Ping (to probe network for other peers)
- Pong (reply to ping)
- Push (used to initiate file transfer)
+ Gnutella Protocol Header
+----------------------------------------------------+-------
| Desc ID | Payload Descr | TTL | Hops | Payload len | ...
+----------------------------------------------------+-------
Desc ID: ID of this search
Payload Descr: 0x00 Ping; 0x01 Pong; 0x40 Push; 0x80 Query;
0x81 QueryHit
TTL: Initially 7-10, drop after reaching 0.
Hops: Increment at each Hop. Used this because TTL is not same for all
clients. Many clients don't use Hops that much.
Payload len: # of bytes of message following this header.
+ Query (0x80)
+------------+------------------------------+
| Min Speed | Search criteria (keywords) |
+------------+------------------------------+
0 1 ...
+ Queries are flooded, TTL-restricted, duplicates are not forwarded.
+ QueryHit looks as follows:
+--------------------------------------------------------------+
|# of hits|port|IP addr|speed|(file idx,fname,fsize)|servent_id|
+--------------------------------------------------------------+
- servant_id is mostly not used.
- Peers maintain reverse route, where they received Query msg from.
- QueryHits are forwarded back on the same route.
Avoiding Excessive Traffic
---
- Query forwarded to all neighbors except one from which it received.
- Each Query forwarded only once.
- QueryHit routed back only to peer from which Query was received.
- If peer is missing, QueryHit is dropped.
Dealing with Firewalls
---
- Push message handles this (TODO: I didn't understand)
- Ping & Pong are used to update neighbor info.
Problems
---
- Ping/Pong constituted 50% of traffic (this has been solved).
- Cache and forward that info, reducing ping/pong messages.
- Problem of "freeloaders". 70% of users are freeloaders.
- Can the Query be directed instead of flooding?
FastTrack & BitTorrent
---
- FastTrack is proprietary. Hybrid between Gnutella and Napster.
- Uses "healthier" peers in Gnutella, called Super Nodes.
- Peers become Super Nodes based on reputation and contribution.
- Super Node maintains directory of files and related info. Queries are not
flooded but searched locally.
BitTorrent
---
- Incentivize peers to participate.
- Tracker keeps track of peers. When a peer comes online, it contacts Tracker to
send details of other peers.
- Peers are either Seed (contains full file) or Leech (partial file).
- Files are divided in to blocks (usually of equal size 32K-256KB).
- Leeches get blocks from Seed and other Leeches.
Which blocks are transferred to a Leech?
---
- Download Local Rarest First block policy
+ prefer early download of blocks that are least replicated among neighbors.
- Tit for tat bandwidth usage: Provide blocks to neighbors that provided it best
download rates.
+ Incentive for nodes to provide good download rates.
- Choking: Limit number of neighbors to which concurrent uploads (less than 5).
These are *best* neighbors.
Chord (came from Academia)
---
- Distributed Hash Table.
- Performance Goals of DHT
+ load balancing
+ fault-tolerant
+ efficiency of lookups and inserts
+ locality
- Napster, Gnutella, FastTrack are all sort of DHTs, but Chord is accurate DHT.
- Chord is O(log N) memory, lookup latency and # of messages for a lookup.
- Chord uses Consistent Hashing on nodes peer address
+ SHA-1 (ip address, port) --> 160 bit string.
+ Truncated to m bits
+ Called peer id (number between 0 and 2^m -1)
+ Not unique, but id conflicts very unlikely.
+ Can then map peers to one of 2^m logical points on a circle.
- Nodes maintain two types of Peer Pointers
- Peer Pointers (1):
+ Peers are placed on a ring, with an id between 0 and 2^m-1
+ Peers talk to their neighbors, called successors.
- Peer Pointers (2): Finger Tables
+ TODO: Watch video 5 around 11th minute to understand.
+ ith entry in finger table for peer n is first peer with:
id >= (n + 2^i) (mod 2^m)
Example: Suppose there are following nodes in the cluster
* 16, 32, 45, 80, 96 and 112
FT for Node 80 looks as follows (m=7 here):
i FT[i]
0 96
1 96
2 96
3 96
4 96
5 112
6 16
- Where are files stored?
+ SHA-1 (filename) --> 160 bit string (key)
+ File is stored at first peer with id >= it's key (mod 2^m)
Example: In above cluster, if node 80 gets a file with id = 30 to be inserted,
what does it do?
1) File 30 must be inserted at Node 32
2) Node 80 looks up it's finger table. There's no pointer to 32. It hands off
to it's successor to look up it's FT. This continues till Node 32 is reached.
3) Node 80 hands off to Node 96, whose finger table looks as follows
i FT[i]
0 112
1 112
2 112
3 112
4 112
5 16
6 16
4) There's no pointer to Node 32 in above table. So, Node 96 will hand it off
to Node 112, it's successor, whose finger table looks as follows
i FT[i]
0 16
1 16
2 16
3 16
4 16
5 16
6 80
5) Again, Node 112's FT doesn't have pointer to 32. This sucks. It hands off
to it's successor, Node 16,
i FT[i]
0 32
1 32
2 32
3 32
4 32
5 80
6 80
6) Node 16 is able to pass the file to Node 32 which will store it.
- How do file searches work?
+ Hash the unique file name (or URL) using consistent hashing algo.
+ It gives key K.
+ At node N, send query for key K to largest successor/finger entry <= K
If none exists, send query to successor(N). Largest successor means, on
the virtual ring, it's the right most, but less than K.
Analysis: Finding key K (Search) takes O(log N), where N is number of nodes in
cluster
Pastry
---
- Assigns IDs to nodes, similar to Chord.
- Each node knows its successors and predecessors.
- Routing tables based on prefix matching, thus log(N)
- Among potential neighbors, one with shortest RTT is chosen.
- Shorter prefix neighbors are closer than longer prefix neighbors.
- But overall the longest neighbors RTT is comparable to underlying internet
speed.
Week 4
Topics
* Key-Value stores, NOSQL
* CAP theorem
* Cassandra, HBase
* Time ordering, NTP, Cristian's Algo, Vector Clocks, Lamports Timestamps
Key-Value Store Abstraction
---
- Similar to dictionary data struct, but distributed. Distributed hash as in
P2P systems.
- In RDBMS (ex. MySQL), data is structured and stored in tables with few
main items:
Primary Key, Secondary Key, Foriegn Key and few columns.
- Data in today's world is different than problems solved by RDBMS
+ Data is large and unstructured
+ Lots of random read/writes (sometimes write heavy)
+ Foreign keys rarely needed
+ Joins are infrequent
- Requirements of Today's workload
+ Speed
+ Avoid single point of failure (SPoF)
+ Low TCO (Total Cost of Operation)
+ Scale out and not up
- Scale Out, Not Scale Up
Scale Up = grow cluster capacity by replacing with more powerful m/c
Scale Out = grow by adding COTS m/c (off the shelf)
Key-Value (NoSQL) Data Model
---
- NoSQL = "Not Only SQL"
- Main operations = get(key), put(key, value)
+ and few more extended operations
- Tables have following properties
+ "Column Families" in Cassandra, "Table" in HBase, "Collection" in
MogoDB
+ Unstructured.
+ Don't always support JOIN or have Foreign Keys
+ Can have index tables like RDBMS
Column-Oriented Storage
---
- RDBMS store an entire row together
- NoSQL systems typically store a column together (or group of columns)
+ given a key, entries within a column are indexed are easy to locate
and vice-versa.
- Searches involving columns are fast.
+ Ex: Get all blog id's that were updated in last month.
- Prevents fetching entire table (or row) on a search.
Apache Cassandra
---
History
-------
- Facebook is original designer. Now managed by Apache and many companies
use it.
- Joined Apache Incubator in 2009
- Apache top level project in 2010
- 0.8 version in 2011 had CQL language
- 2.0 version in 2013 had Light Weight Transactions (LWT)
- 3.0 in 2015 had Materialized Views and SS Table Attached Secondary Indexes
(SASI)
* Materialized Views is similar to Relational Tables of SQL databased but
with few constraints (TODO)
Cassandra Data Models
-----
- Tables
* Narrow Tables, but very tall
* There are no JOINs. Hence, tables are De-Normalized. That's because JOINs
are expensive and read's become slow if JOIN exists
* There are no Foreign Key constraints
* Tables always exists within a keyspace
Example:
cqlsh> CREATE KEYSPACE essentials WITH REPLICATION = {'class': 'SimpleStrategy',
'replication_factor': 1};
cqlsh> USE essentials;
cqlsh:essentials> CREATE TABLE movies (
... movie_id UUID,
... title TEXT,
... release_year INT,
... PRIMARY KEY ((movie_id)) <<< why the double braces??
... );
cqlsh> insert into movies (movie_id, release_year, title) values (uuid(), 2011,
'tree of life');
cqlsh> select * from movies;
- Keys
* All access to tables requires key. It is central concept
* Primary Key is composed of Partition Key and Cluster Key(s). There can be
multiple columns in Primary Key
cqlsh> CREATE TABLE movies_by_actor (
... actor TEXT,
... release_year INT,
... movie_id UUID,
... title TEXT,
... genres SET<TEXT>,
... rating FLOAT,
... PRIMARY KEY ((actor), release_year, movie_id)
... ) WITH CLUSTERING ORDER BY (release_year DESC, movie_id ASC);
- In above primary key, 'actor' is Partition Key. All rows of a partition are in
same server/node. All 'Brad Pitt' movies are on same node.
- Clustering Keys are (release_year, movie_id). They order rows within a
Partition Key. Then queries within a partition are performant as they are
within same node.
Ex: Return all Brad Pitt movies between 2011 and 2015.
- Note that a column MUST be in Primary Key to be used in query
- Partition Key is not orderable while Clustering Key can be ordered
# Valid query
cqlsh> select * from movies_by_actor where actor='Brad Pitt' and release_year < 2015
# Invalid query because rating is not in PK
cqlsh> select * from movies_by_actor where actor='Brad Pitt' and release_year <
2015 and rating > 8
Data Types
----
- text, int, bigint, float, double, boolean
- decimal, varint
- uuid, timeuuid
- inet
- tuple
- timestamp
- counter
Collections
----
- Set
- List << ordered elements
- Map << key:value pairs
Secondary Indexes
---
Nodes & Clusters
---
-
Key->Server Mapping (called Partitioner)
---
- Uses Virtual Ring DHT described in Chord protocol/lecture but without finger
tables or routing tables.
DHT = Distributed Hash Table
- Coordinator: One coordinator per DC
- Partitioner: Maintains Key to Server mapping
Data Placement Strategies
---
Two Options:
+ SimpleStrategy
+ NetworkTopologyStrategy
SimpleStrategy:
+ Uses Partitioner, of which there are two kinds
* RandomPartitioner - Chord like hash partitioning.
* ByteOrderPartitioner - Assigns ranges of keys to servers. This helps
in range queries
NetworkTopologyStrategy: for multi-DC deployments
+ 2 replicas/DC, 3 replicas/DC
+ Per DC:
* First replica placed according to Partitioner
* Then go clockwise around ring until you hit a different rack.
Snitches:
- Maps IPs to Nodes in Racks and DCs. Configured in cassandra.yaml
- SimpleSnitch: Unaware of Topology (rack-unaware)
- RackInferring: Octect of IP indicates DC and Rack.
+ x.<DC>.<rack>.<node> - This is a best effort.
- PropertyFileSnitch: Uses a config file to assign IP
- EC2Snitch:
- Variety of snitch options available.
Writes - How are they implemented?
---
- Need to be lock-free and fast (no disk seeks)
- Clients sends write to a coordinator in Cassandra cluster.
- Coordinator uses Partitioner to send query to all replica nodes responsible
for key.
- When X replicas respond, coordinator returns an ACK to client.
+ What is X?
- Replicas must be ALWAYS writable. How to achieve?
+ Hinted Handoff Mechanism
* If replica is down, coordinator writes to all other replicas, but
waits for replica to come up within a time limit.
* If ALL replicas are down, coordinator waits for a time (few hours) and
then writes to replica.
+ One ring per DC.
* Per DC coordinator elected to talk to other DCs.
* Election of coordinator done via Zookeeper.
When writes come, what does Replica do?
---
- log the request.
- Make changes to Memtable (in-memory representation of key-value pairs)
- memtable can be searched. It's a write-back cache as opposed to write-through
cache.
- When memtable is full, flush to disk. Disk will have following:
* Data File: An SSTable (Sorted String Table) - list of KV pairs, sorted
by key
* Index File: Another SSTable of Key, position in Data SSTable
+ Search for a missing Key is expensive. A binary search will incur
log(M)/SSTable if there are M entries in SSTable. COSTLY!!
* Bloom Filter, for efficient search
- Add a Bloom Filter to tell if a key is present in SSTable. Very efficient.
+ An item NOT present can return as present.
+ An item present ALWAYS returns as present.
// Video on Cassandra, 16th min
Bloom Filter
---
- Compact way of representing a set of items
- Checking for existence in set is cheap
- Some probability of false positives
+ An item NOT in any set MAY return as present
- Never false negatives
+ An item present ALWAYS returns as present
How does it work?
---
- It's a large bitmap
- Filter uses a large bitmap and n Hash functions: H1, H2, ... Hn
- All bits are initialized to 0. Output of Hash function is one of the bits in
the bitmap
Insert Key, K:
for each hash function, Hi:
bitmap[Hi(k)] = 1 // just set the bit
Is Key, K, present?
for each hash function, Hi:
if bitmap[Hi(K)] == 0
return false
return true
- For large Bloom Filters, probability of false positives is very very low
Compaction
---
- Over time, key may be in multiple SSTable and compaction is process of merging
them. Runs periodically and locally at each server.
Deletes
---
- Don't delete item rightaway, but put a marker called "tombstone". When
Compaction is run, it deletes.
Reads
---
- Coordinator can contact X replicas. From responses, Coordinator returns latest
time-stamped response.
- Coordinator also fetches value from other replicas (in background) to check
consistency and initiating *read repair*.
- Due to this, Reads are slower than Writes.
Membership
---
- Every server in cluster has list of all others in the cluster.
Suspicion Mechanisms
---
- Mechanism to confirm failure of a server is indeed true (or false).
- Accrual Detector: it outputs a value called Phi, which represents suspicion.
- Phi calculations for a member:
+ takes in to account inter-arrival times for gossip messages from a server.
+ Phi(t) = -log(CDF)/log10
* CDF = Cumulative Distribution Function or Prob(t_now - t_last)
+ Phi determines the detection timeout by taking in to historical inter
arrival time.
- In practice, Phi=5 means about 10-15s detection time.
Cassandra Vs RDBMS (MySQL): On 50 GB of data
---
- MySQL:
+ writes 300ms avg; reads 350ms avg
- Cassandra:
+ writes .12ms avg and reads 15ms avg
CAP Theorem
---
- Consistency, Availability, Partition-Tolerance
- Only 2 out of 3 can be satisfied.
- Cassandra: Eventual (Weak) Consistency, Availability & Paritition-Tolerance.
- RDBMs: Strong Consistency over Availability.
- RDBMS provide ACID
+ Atomicity
+ Consistency
+ Isolation
+ Durability
- Key-Value Stores like Cassandra provide BASE
+ Basically Available Soft-state Eventual Consistency
- What is X in Cassandra?
+ Represents consistency levels
- X = ANY
+ Any server can respond to read. Fastest.
- X = ALL
+ All replicas must respond. Slowest.
- X = ONE
+ At least one replica must respond. Faster than ALL. Can't tolerate
failure.
- X = QUORUM
+ Look at Video (1.3) around 13th minute.
+ Faster than ALL. Gives greater consistency than ONE.
+ Many NoSQL uses Quorum (Cassandra, RIAK)
Reads with Quorum
---
- Client specifies R (# of replicas).
- Read must be received from R replicas.
Write with Quorum
---
- Client specifies W and writes to W replicas and returns.
- Coordinator (a) blocks until quorum is reached OR (b) Asynchronous; two
flavors exist.
- For strong consistency two conditions required:
1) W+R > N
2) W > N/2
R = read replica count; W = write replica count
QUORUM Variations
---
- QUORUM: across all DCs
- LOCAL_QUORUM: in coordinators DC
- EACH_QUORUM: in every DC
Consistency Spectrum
---
- From Weak to Strong
- Eventual, Causal, per-key Sequential, Red-Blue, Probabilistic, CRDTs,
Sequential.
- Per-Key Sequential: Per-Key basis, all operations have a global order.
- CRDTs (Commutative Replicated Data Types):
+ Operations for which commutated writes give same result.
+ Ordering is unimportant in that case.
- Red-Blue: Divide operations in to Red and Blue. Order is MUST/important for
Red operations but not so for Blue.
- Causal: Reads and Write are causally linked. One happens before other or
depends on other.
What are models of Strong Consistency?
---
- Linearizability: Each op is visible instantaneously to all other clients.
- Sequential Consistency [Leslie Lamport]:
- Some NoSQL are supporting ACID.
HBase
---
- Yahoo implemented Google's BigTable called HBase.
- API Functions:
+ get/put(row)
+ scan(row range, filter) - range queries
+ multiput
- HBase prefers consistency over availability.
HBase Architecture
---
- Lookup online.
- HDFS is underlying storage.
HBase Storage Hierarchy
---
- HBase Table
+ Split in to multiple regions and replicated.
+ ColumnFamily = subset of columns with similar query patterns.
+ One "Store" per combination of ColumnFamily and Region.
* "Memstore" for each Store: in-memory updates to Store. Flushed to disk
when full.
* StoreFiles for each Store (HFile is underlying format)
- HFile
+ SSTable from Google's BigTable
How does HBase maintain Strong Consistency?
---
- Using Write-Ahead Log, called HLog.
- HLog is written so if there's a failure in Memstore, HRegionServer/HMaster can
replay the message from HLog and write to Memstore.
Cross-Datacenter Replication
---
- Single "Master" cluster.
- Other "Slave" clusters replicate same tables.
- Master cluster synchronously sends HLogs over to Slave clusters.
- Coordination among clusters via Zookeeper.
- Zookeeper can be used as a file system to store control info.
Time and Ordering
---
Asynchronous Distributed Systems:
---
- Different clocks and diff systems all together.
Clock Skew vs Clock Drift
---
- CS: Relative diff in clock values of two processes
+ Like distance between two vehicles on a road.
- CD: Relative diff in clock frequencies (rates) of two processes.
+ Like difference in speeds of two vehicles on the road.
- non-zero CS means, clocks are not synchronized.
- non-zero CD causes CS to increase.
How often to synchronize?
---
- MDR: Maximum Drift Rate of a clock
- Coordinated Universal Time (UTC): is "correct" time at any point.
- Absolute MDR is defined relative to UTC.
- Max drift rate between two clocks with similar MDR = 2*MDR
- Given a max acceptable skew M, between any pair of clocks, need to synchronize
at least once every: M/(2*MDR)
+ Since time = distance/speed
External vs Internal Synchronization
---
External Sync
---
- Each process clock C is within a bound D of a well-known external clock S.
|C-S| < D at all times.
- Ex: Cristian's algorithm and NTP
Internal Sync
---
- Every pair of processes in a group have clocks within bound D
|C(i) - C(j)| < D at all times for processes i and j.
- Ex: Berkeley algo.
- External Sync with D = Internal Sync with 2*D
- Internal Sync does not imply External Sync
+ In fact, the entire system may drift away from external clock S
Cristian's Algorithm
---
- External time sync. All processes P sync with external time server S.
- P measures RTT of message exchanges.
- min1 = minimum P->S latency
- min2 = minimum S->P latency
- Actual time at P when it receives response is between:
[t+min2, t+RTT-min1]
Note that RTT > (min1+min2)
- P sets its time to halfway through this interval
t+(RTT+min2-min1)/2
- Erro is at most (RTT-min2-min1)/2
+ Bounded error.
Gotchas
---
- Allowed to increase clock value but not decrease, to maintain linearity of
events. Otherwise it may violate ordering of events within a process.
- Allowed to increase or decrease clock speed.
- If error is too high, take multiple readings and average them.
NTP
---
- NTP servers organized in a tree.
- Root of tree = Primary Servers, where UTC time/clock present.
- Children of Root = Secondary Servers.
- Grandchildren of Root = Tertiary Servers.
- Each client is leaf of the tree.
TR1 TS2
Child ------------^---------------------^-------------------------->