您所在位置：网站首页 > 高等教育 > 研究生课件 > 数据库管理系统概述英文版课件：tutorial9 Join Algorithms

数据库管理系统概述英文版课件：tutorial9 Join Algorithms.ppt

18页

卖家[上传人]：枫**

文档编号：580587348

上传时间：2024-08-29

文档格式：PPT

文档大小：736KB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15金贝

下载

/ 18 举报版权申诉马上下载

文本预览

下载提示

常见问题

The Hong Kong University of Scienceand TechnologyCOMP231 Tutorial 9Join AlgorithmsDepartment of Computer Science, HKUST2Database Management SystemsReview: Block Nested-Loop JoinqRead in the outer relation r page by pageqFor each page of r, scan the entire inner relation s.qBest cost: br + bs (if s is small enough to fit in memory)Øbr (bs): no. of pages of r (s)qBuffer needed at least: 3 pages (1 for r, 1 for s, 1 for output)ØIf memory has M pages, read M – 2 pages of r at a time and use the remaining two pages for s and the output.ØCost = br / (M – 2)  bs + br00050005000200020004000400020002000200020001000100050005000500050002000200020002000300030002000200050005r rs sDepartment of Computer Science, HKUST3Database Management SystemsReview: Indexed Nested-Loop JoinqIndex lookup can replace file scan if an index is available on the join attribute of inner relationqFor each tuple tr in the outer relation r, use the index to look up tuples in s that satisfy the join condition with tuple tr.00050005000200020004000400020002000200020001000100050005r r000500050002000200020002000300030002000200050005s sindex on sindex on sqCost of the join: br + nr  cØnr: no. of tuples in rØc is the cost to traverse index and fetch all matching s tuples for one tr.Øc can be estimated as cost of a single selection on s using the join condition.If indices are available on the join attribute of both r and s,use the relation with fewer tuples as the outer relation.Department of Computer Science, HKUST4Database Management SystemsReview: External Sort-Merge•Algorithm:–Create sorted runs–N-way merge:•Read the first page of each run into buffer page•REPEAT–Select the first record (in sorted order) among all buffer pages–Write the record to the output buffer. If the output buffer is full write it to disk.–Delete the record from its input buffer page.IF the buffer page becomes empty THENread the next page (if any) of the run into the buffer. •UNTIL all input buffer pages are empty1210111220791792800…sorted run 1sorted run 2sorted run 80pass0pass1run 1run 2run 910*9 pagessorted run 110*8 pagessorted run 9pass2…sorted file800 pagesEg: 800 pages of a file10 pages of main memoryDepartment of Computer Science, HKUST5Database Management SystemsReview: Sort-Merge JoinqSort both relations on the join attribute (if not already sorted)qMerge the sorted relations to join themØJoin step is similar to the merge stage of the MergeSort algorithm ØMain difference is the handling of duplicate values in join attribute — every pair with the same value on join attribute must be matchedqCan be used only for equi-joins and natural joinsqEach block needs to be read only once qCost = br + bs + the cost of sorting if relations are unsorted00010001000200020002000200020002000400040005000500050005000200020002000200020002000300030005000500050005Department of Computer Science, HKUST6Database Management SystemsReview: Hash Join0001000100020002000200020002000200040004000500050005000500020002000200020002000200030003000500050005000500010001000200020002000200020002000400040005000500050005000200020002000200020002000300030005000500050005q Applicable for equi-joins and natural joinsq A hash function h is used to partition tuples of both relations Ø h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the common attributes of r and s used in the natural join. Ør0, r1, . . ., rn denote partitions of r tuples; each tuple tr  r is put in partition ri,where i = h(tr [JoinAttrs]).Ø s0,, s1. . ., sn denotes partitions of s tuples; each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]).Department of Computer Science, HKUST7Database Management SystemsJoin Exercise •Assume the query: Find the names of sailors who have reservations•# Records of Sailors nr = 10,000; •# Records of Reservations ns = 40, 000. •# of memory pages M = 100Sailors(Sid, Name, Rating, Age)Reserves(Sid, Bid, Date)SELECT NameFROM Sailors, ReservesWHERE Sailors.Sid=Reserves.SidDepartment of Computer Science, HKUST8Database Management SystemsJoin Exercise •Assume the query: Find the names of sailors who have reservations•Each attribute is 20 bytes. Each page is 1000 bytes. There are 10,000 sailors, 40,000 reservations and a buffer of M=100 pages.nQuestion 0: How many pages does Sailors contain? How many pages does Reservations contain?–Each sailor record is 80 bytes and each reserves record 60 bytes. –There are 1000/80=12 sailors per page and 1000/60=16 reservations per page. –Sailors contains 10000/12=834 pages and reserves 40000/16=2500 pages. •Since there are 10K sailors and 40K reservations, a sailor has on the average 4 reservations (but it is possible that some sailors have more, while others have none). Sailors(Sid, Name, Rating, Age)Reserves(Sid, Bid, Date)SELECT NameFROM Sailors, ReservesWHERE Sailors.Sid=Reserves.Sid# Records of Sailors ns = 10,000; # Records of Reservations nr = 40, 000. # of pages of Sailors bs, = 834; # of pages of Reservations br = 2500# of memory pages M = 100Department of Computer Science, HKUST9Database Management SystemsBlock Nested Loop JoinnQuestion 1: Using Block Nested Loop, if use Sailors as outer relation, what is the cost?•Mechanism:–Buffer M-2 pages for Outer Relation, one for Inner relation, the other for output. •For each time, read 98 pages of S. –Scan R page-by-page to find match•Total 834 pages, thus Need 834/98 = 9 blocks/times•The total cost is: 834+9*2500= 23,334 •bout + bout / (M – 2) * binner# Records of Sailors ns = 10,000; # Records of Reservations nr = 40, 000. # of pages of Sailors bs, = 834; # of pages of Reservations br = 2500# of memory pages M = 100Department of Computer Science, HKUST10Database Management SystemsBlock Nested Loop JoinnQuestion 2: Using Block Nested Loop, if use Reservations as outer relation, what is the cost?•Mechanism:–Buffer M-2 pages for Outer Relation, one for Inner relatoin, the other for output. •bout + bout / (M – 2) * binner•For each time, read 98 pages of R. –Scan S page-by-page to find match•Total 2500 pages, thus Need  2500/98  = 26 times•The total cost is: 2500+26*834= 24,184 # Records of Sailors ns = 10,000; # Records of Reservations nr = 40, 000. # of pages of Sailors bs, = 834; # of pages of Reservations br = 2500# of memory pages M = 100Department of Computer Science, HKUST11Database Management SystemsIndex Nested LoopnQuestion 3: Assume use Hash Index on Reserve.SID. Assume for each sailor the cost for searching on the hash index take 1.2 page accesses (because of overflow buckets). •For each Sailor, 1.2 + 4 = 5.2 pages need to spend to retrieve matching Reserve record•Total Cost: cost of reading Sailors + # records in sailors*5.2= 834+ 10,000*5.2=52,83456 Frank57 Sophie58 James59 Tom145858 $213 $158 $814 $258 $332 $758 $5．．．．．．．．．．．．．．．．．．H(58)234Find entry in the index cost 1.2 pagesAverage 4 entries need to retrieveSaliorsIndex using hashingon ReserveH(SID)=SID % 11Reserve# Records of Sailors ns = 10,000; # Records of Reservations nr = 40, 000. # of pages of Sailors bs, = 834; # of pages of Reservations br = 2500# of memory pages M = 100Department of Computer Science, HKUST12Database Management SystemsIndex Nested LoopnQuestion 3: Assume use Hash Index on Reserve.SID. Assume for each sailor the cost for searching on the hash index take 1.2 page accesses (because of overflow buckets). •How to improve? •Remember the query: •Thus do not need to retrieve records from Reserve•Now for each Sailor, only 1.2 pages needed to get the names would be matched•Total Cost: cost of reading Sailors + # records in sailors*1.2= 834+ 10,000*1.2=12,83459 Tom58 James57 Sophie56 Frank145832 $758 $558 $314 $258 $813 $158 $2．．．．．．．．．．．．．．．．．．H(58)234SaliorsIndex using hashingon ReserveH(SID)=SID % 11ReserveSELECT NameFROM Sailors, ReservesWHERE Sailors.Sid=Reserves.SidDepartment of Computer Science, HKUST13Database Management SystemsSort Merge Join •Question 4: Using Sort Merge Join method, sort Sailors and Reserve first and then merge. Estimate the total cost. •Cost of Sorting Sailors on SidSorting Sailors: 834+417 (pass 0) + 417 + 417 (pass 1) = 2085Sailors(Sid, Name, Rating, Age)Reserves(Sid, Bid, Date)128341250sorted run 1sorted run 2sorted run 9run 1run 2run 9......PASS 0PASS 151100401417At each sorted run we read on 100 sailor pages but we only write 50 because we discard attributes Rating, Age(they are not needed for the join and are not required in the result)Department of Computer Science, HKUST14Database Management SystemsSort Merge Join (cont)•Sort Reserves on Sid Cost: 2500+850 (pass 0) + 850=4200. As each sorted page of Reserves is generated we can directly find the joining tuples of sailors (in order to avoid writing the result of pass 1 in a temporary file and reading it again for the merge phase). Thus, for joining we only need to scan the sorted sailors file.•Total cost: 2085+4200+417= 6702 Sailors(Sid, Name, Rating, Age)Reserves(Sid, Bid, Date)Department of Computer Science, HKUST15Database Management SystemsMore example about Merge Join•R1 has 1000 pages, R2 has 500 pagesi R1{i}.CR2{j}.C j110 51220 202320 203430 304540 305 506 527MemoryR1R2…..…..R1R2Given R1 and R2 are already sorted on CTotal cost: Read R1 cost + read R2 cost= 1000 + 500 = 1,500 pagesDepartment of Computer Science, HKUST16Database Management SystemsHash Join on Sailor and Reservation•Use the small relation (Sailors = 834) as the build input. •How to choose the number of buckets n?–n should be such that each partition for Sailors fits in memory, e.g., n= 10 buckets (so that average bucket size is 83.4 pages < M). 100 main memory buffersDiskDiskOriginal SOUTPUT1INPUT0hashfunctionh 9Partitions019. . .100 main memory buffersDiskDiskOriginal ROUTPUT1INPUT0hashfunctionh 9Partitions019. . .Cost of Partition S:834 + 834READWriteCost of Partition S:2500 + 2500READWriteDepartment of Computer Science, HKUST17Database Management SystemsHash Join(cont)•Finally matching process:–Use the small relation (Sailors = 834) as the build input, R as probe input. –Match bucket of Sailors with corresponding bucket of Reserves. –Matching cost: 834 + 2500•Total Cost: 3(2500+834)=10,002 Partitionsof R & SInput bufferfor RiHash table for bucket partition Si100 main memory buffersDiskOutput bufferDiskJoin Resulthashfnh2h2Department of Computer Science, HKUST18Database Management SystemsHybrid Hash–Join•What’s Hybrid Hash Join?–Main feature of hybrid hash join: Keep the first partition(s) of the build relation in memory. •Sailors as the build input. Assign 90 memory pages to the first bucket in partition of Sailors. Thus first bucket do not need to write back. –Size of first bucket =84 < 90, keep it all in memory. –Cost = 834+(834-84) (we do not write the first bucket to memory)•In participation of Reserves:–record of R belong to bucket 1 match directly with first bucket of Sailors. –Others remain the same method. –Cost = 2500+(2500-250). •Finally we read buckets 2-10 of Sailors and match them to the corresponding buckets of Reserves. Cost= (834-84)+(2500-250). •Total cost= 3(2500+834)-2(84+250)=10,002-668=9,334。

点击阅读更多内容