Du kannst nicht mehr als 25 Themen auswählen
			Themen müssen entweder mit einem Buchstaben oder einer Ziffer beginnen. Sie können Bindestriche („-“) enthalten und bis zu 35 Zeichen lang sein.
		
		
			
		
		
		
		
			
		
			
	
	
		
			153 Zeilen
		
	
	
		
			6.7 KiB
		
	
	
	
		
			TeX
		
	
		
		
			
		
	
	
			153 Zeilen
		
	
	
		
			6.7 KiB
		
	
	
	
		
			TeX
		
	
| 
								 
											vor 3 Jahren
										 
									 | 
							
								%! Author = mrmcx
							 | 
						||
| 
								 | 
							
								%! Date = 14.07.2022
							 | 
						||
| 
								 | 
							
								%! Template = Fabian Wenzelmann, 2016--2019
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								\documentclass[a4paper,
							 | 
						||
| 
								 | 
							
								    twoside, % to have to sided mode
							 | 
						||
| 
								 | 
							
								    headlines=2.1 % number of lines in the heading, increase if you want more
							 | 
						||
| 
								 | 
							
								]{scrartcl}
							 | 
						||
| 
								 | 
							
								\usepackage[
							 | 
						||
| 
								 | 
							
								    margin=2cm,
							 | 
						||
| 
								 | 
							
								    includefoot,
							 | 
						||
| 
								 | 
							
								    footskip=35pt,
							 | 
						||
| 
								 | 
							
								    includeheadfoot,
							 | 
						||
| 
								 | 
							
								    headsep=0.5cm,
							 | 
						||
| 
								 | 
							
								]{geometry}
							 | 
						||
| 
								 | 
							
								\usepackage[utf8]{inputenc}
							 | 
						||
| 
								 | 
							
								\usepackage[english]{babel}
							 | 
						||
| 
								 | 
							
								\usepackage[T1]{fontenc}
							 | 
						||
| 
								 | 
							
								\usepackage{mathtools}
							 | 
						||
| 
								 | 
							
								\usepackage{amssymb}
							 | 
						||
| 
								 | 
							
								\usepackage{lmodern}
							 | 
						||
| 
								 | 
							
								\usepackage[automark,headsepline]{scrlayer-scrpage}
							 | 
						||
| 
								 | 
							
								\usepackage{enumerate}
							 | 
						||
| 
								 | 
							
								\usepackage[protrusion=true,expansion=true,kerning]{microtype}
							 | 
						||
| 
								 | 
							
								\usepackage{hyperref}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								\newcommand{\yourname}{Simon Moser}
							 | 
						||
| 
								 | 
							
								\newcommand{\lecture}{Advanced Database and Information Systems}
							 | 
						||
| 
								 | 
							
								\newcommand{\project}{Project 2}
							 | 
						||
| 
								 | 
							
								\author{\yourname}
							 | 
						||
| 
								 | 
							
								\title{\lecture}
							 | 
						||
| 
								 | 
							
								\subtitle{\project}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								\pagestyle{scrheadings}
							 | 
						||
| 
								 | 
							
								\setkomafont{pagehead}{\normalfont}
							 | 
						||
| 
								 | 
							
								\lohead{\lecture\\\yourname}
							 | 
						||
| 
								 | 
							
								\lehead{\lecture\\\yourname}
							 | 
						||
| 
								 | 
							
								\rohead{\project}
							 | 
						||
| 
								 | 
							
								\rehead{\project}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								\begin{document}
							 | 
						||
| 
								 | 
							
								    \maketitle
							 | 
						||
| 
								 | 
							
								    \section{Problem Statement}
							 | 
						||
| 
								 | 
							
								    \label{sec:problem-statement}
							 | 
						||
| 
								 | 
							
								    This report presents different approaches on Join\cite{wikijoin} algorithms on tables.
							 | 
						||
| 
								 | 
							
								    When two tables are joined, all columns are combined based on one related column in each table.
							 | 
						||
| 
								 | 
							
								    Every possible combination of rows where those columns are equal is returned.
							 | 
						||
| 
								 | 
							
								    A very simple approach would be to loop through the first table and for each row loop through the second table and check for equality of the related column.
							 | 
						||
| 
								 | 
							
								    Since the complexity of this approach is $O(M*N)$ for table sizes $M$ and $N$ and therefore very high, the following sections explain different approaches to this problem.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    \section{Algorithm Description}
							 | 
						||
| 
								 | 
							
								    \label{sec:algorithm-description}
							 | 
						||
| 
								 | 
							
								    \subsection{Hash Join}
							 | 
						||
| 
								 | 
							
								    \label{subsec:hash-join}
							 | 
						||
| 
								 | 
							
								    The Hash Join\cite{wikihash} algorithm tries to offer a faster solution by using hash tables.
							 | 
						||
| 
								 | 
							
								    The rough description of the algorithm is:
							 | 
						||
| 
								 | 
							
								    \begin{enumerate}
							 | 
						||
| 
								 | 
							
								        \item Add each row of the smaller table to the hash table, where compared column is used to create the hash.
							 | 
						||
| 
								 | 
							
								        \item For each row in the other table, lookup the hash of it's compared column
							 | 
						||
| 
								 | 
							
								        \item If the hash exists in the hash table, compare the two rows whether not only the hash is equal
							 | 
						||
| 
								 | 
							
								    \end{enumerate}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    One caveat of this approach is that the speed relies on the hash table being in-memory.
							 | 
						||
| 
								 | 
							
								    If it exceeds the size of the memory, it will be slowed down.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    This algorithm is expected to have a complexity of $O(M+N)$ because the first table has to be looped to build the hash table and the second table has to be looped to check the hash table.
							 | 
						||
| 
								 | 
							
								    Here, a hash table access is expected to have a complexity of $O(1)$.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    \subsection{Sort Merge Join}
							 | 
						||
| 
								 | 
							
								    \label{subsec:sort-merge-join}
							 | 
						||
| 
								 | 
							
								    A Sort Merge Join\cite{wikimergesort} tries to restrict the complexity by using good sort algorithms.
							 | 
						||
| 
								 | 
							
								    For this approach, both tables are sorted by the compared column first.
							 | 
						||
| 
								 | 
							
								    Then a pointer is set on the start of each column.
							 | 
						||
| 
								 | 
							
								    Now for each step the related columns for each pointer are compared.
							 | 
						||
| 
								 | 
							
								    If they are equal, the rows are added to the result.
							 | 
						||
| 
								 | 
							
								    If not, the pointer in the table with the smaller value is advanced by one row.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    This approach relies on two things: first, the content of the compared column has to be sortable in any way.
							 | 
						||
| 
								 | 
							
								    Second, since both tables need to be sorted first, an effective sort algorithm has to be used.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    Merge sort of the two tables has a complexity of $O(M \log_2 M + N \log N)$.
							 | 
						||
| 
								 | 
							
								    This exceeds the complexity of the approach in subsection\ \ref{subsec:hash-join}.
							 | 
						||
| 
								 | 
							
								    Still, this algorithm can be more effective when the tables are already sorted.
							 | 
						||
| 
								 | 
							
								    In this case, the complexity drops to $O(M+N)$ that is required for walking through the tables.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    \section{Dataset Description}
							 | 
						||
| 
								 | 
							
								    \label{sec:dataset-description}
							 | 
						||
| 
								 | 
							
								    For the analysis, the WatDiv\cite{watdiv} dataset is used.
							 | 
						||
| 
								 | 
							
								    It consists out of the following entity types:
							 | 
						||
| 
								 | 
							
								    \begin{itemize}
							 | 
						||
| 
								 | 
							
								        \item wsdbm:User
							 | 
						||
| 
								 | 
							
								        \item wsdbm:Product
							 | 
						||
| 
								 | 
							
								        \item wsdbm:Review
							 | 
						||
| 
								 | 
							
								    \end{itemize}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    They have the following relations:
							 | 
						||
| 
								 | 
							
								    \begin{itemize}
							 | 
						||
| 
								 | 
							
								        \item wsdbm:User wsdbm:follows wsdbm:User
							 | 
						||
| 
								 | 
							
								        \item wsdbm:User wsdbm:friendOf wsdbm:User
							 | 
						||
| 
								 | 
							
								        \item wsdbm:User wsdbm:likes wsdbm:Product
							 | 
						||
| 
								 | 
							
								        \item wsdbm:Product rev:hasReview wsdbm:Review
							 | 
						||
| 
								 | 
							
								    \end{itemize}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    \section{Experiment and Analysis}
							 | 
						||
| 
								 | 
							
								    \label{sec:experiment-and-analysis}
							 | 
						||
| 
								 | 
							
								    \subsection{Preparations}
							 | 
						||
| 
								 | 
							
								    \label{subsec:preparations}
							 | 
						||
| 
								 | 
							
								    To be able to import the dataset using the Python library rdflib\cite{rdflib}, the file contents were brought to an url format using the following bash command:
							 | 
						||
| 
								 | 
							
								    
							 | 
						||
| 
								 | 
							
								    \verb$ sed -E 's/([a-zA-Z0-9]+:[^ \t]+)/<\1>/g'$
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    \subsection{Implementation}
							 | 
						||
| 
								 | 
							
								    \label{subsec:implementation}
							 | 
						||
| 
								 | 
							
								    The algorithms were implemented using Python.
							 | 
						||
| 
								 | 
							
								    It might not offer the fastest way, but as an interpreted language Python is very suitable for fast-paced development.
							 | 
						||
| 
								 | 
							
								    The code (also of this report) is published at: \url{https://naclador.de/mosers/ADBIS-Projekt2}
							 | 
						||
| 
								 | 
							
								    
							 | 
						||
| 
								 | 
							
								    \subsection{Analysis}
							 | 
						||
| 
								 | 
							
								    \label{subsec:analysis}
							 | 
						||
| 
								 | 
							
								    Due to a lack of time, an existing bug in the sort merge join could not be found and fixed.
							 | 
						||
| 
								 | 
							
								    Therefore, the following results for the small dataset are not representative:
							 | 
						||
| 
								 | 
							
								    
							 | 
						||
| 
								 | 
							
								    \begin{verbatim}
							 | 
						||
| 
								 | 
							
								        5.67576265335083s for Hash Join (11415461 items)
							 | 
						||
| 
								 | 
							
								        0.15003275871276855s for Sort Merge Join (1475 items)
							 | 
						||
| 
								 | 
							
								        6.041101694107056s for Nested Loop Join (11415461 items)
							 | 
						||
| 
								 | 
							
								    \end{verbatim}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    As it can be seen, the number of result items for the sort merge join differs greatly.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    For smaller tables, Hash Join is expected to be the fasted algorithm since the table can be kept in memory completely.
							 | 
						||
| 
								 | 
							
								    For the bigger table, Sort Merge Join will be faster as it doesn't have a memory limit.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    \section{Conclusion}
							 | 
						||
| 
								 | 
							
								    \label{sec:conclusion}
							 | 
						||
| 
								 | 
							
								    Join Algorithms should be used by use case.
							 | 
						||
| 
								 | 
							
								    There is nothing like the fastest algorithm, for small tables Hash Join is very effective while Sort Merge Join is perfect for pre-sorted tables.
							 | 
						||
| 
								 | 
							
								    The comparison to Nested Loop Join shows that a sophisticated algorithm is usually better than the easiest solution.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    \begin{thebibliography}{watdiv}
							 | 
						||
| 
								 | 
							
								        \bibitem{wikijoin}
							 | 
						||
| 
								 | 
							
								        Join (SQL), Wikipedia, \url{https://en.wikipedia.org/wiki/Join_(SQL)}
							 | 
						||
| 
								 | 
							
								        \bibitem{wikihash}
							 | 
						||
| 
								 | 
							
								        Hash join, Wikipedia, \url{https://en.wikipedia.org/wiki/Hash_join}
							 | 
						||
| 
								 | 
							
								        \bibitem{wikimergesort}
							 | 
						||
| 
								 | 
							
								        Sort-Merge-Join, Wikipedia, \url{https://en.wikipedia.org/wiki/Sort-merge_join}
							 | 
						||
| 
								 | 
							
								        \bibitem{watdiv}
							 | 
						||
| 
								 | 
							
								        Waterloo SPARQL Diversity Test Suite, University of Waterloo, \url{https://dsg.uwaterloo.ca/watdiv/}
							 | 
						||
| 
								 | 
							
								        \bibitem{rdflib}
							 | 
						||
| 
								 | 
							
								        RDFLib, \url{https://rdflib.readthedocs.io}
							 | 
						||
| 
								 | 
							
								    \end{thebibliography}
							 | 
						||
| 
								 | 
							
								\end{document}
							 |