From 29d3307ebd0bf97cb1a6b091458294419a4d05ad Mon Sep 17 00:00:00 2001 From: Saml Creedon Date: Wed, 8 May 2019 22:20:42 +0200 Subject: [PATCH] Presentation Shorter version of the presentation --- Presentation/Presentation (Shorter).tex | 342 ++++++++++++++++++++++++ 1 file changed, 342 insertions(+) create mode 100644 Presentation/Presentation (Shorter).tex diff --git a/Presentation/Presentation (Shorter).tex b/Presentation/Presentation (Shorter).tex new file mode 100644 index 0000000..766afa0 --- /dev/null +++ b/Presentation/Presentation (Shorter).tex @@ -0,0 +1,342 @@ +\documentclass{beamer} +%\usepackage{Default} + +\title{Graph, Algorithms, and Models} +\author{Andrea Civilini, Xinyi Xu, and Sam Creedon} +\date{Whenever} + +\begin{document} + +\begin{frame} \titlepage \end{frame} + +\begin{frame} +\frametitle{Overview} +\tableofcontents +\end{frame} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{What is a Community?} +\subsection{Intuitive Definition} + +\begin{frame} +\frametitle{Intuitive Definition} + +\begin{itemize} + +\item Let $G=(V,E)$ be a graph, with $N=|V|$ and $K=|E|$. + +\item $G$ has a community structure if $\{C_{1},C_{2},\dots,C_{k}\} \vdash V$, and each $C_{i}$ has dense internal connectivity, while sparse external connectivity. + +\item Such a definition is open to interpretation, and as such there is no universal definition of a community. + +\item Motivating Example: Social Networks. + +\end{itemize} + +\end{frame} + +\subsection{Generalisations and Variations} + +\begin{frame} +\frametitle{Generalisations and Variations} + +\begin{itemize} +\item There are many related concepts in the research of communities in networks: +\begin{itemize} +\item Communities in weighted or directed graphs +\item Overlapping communities +\item Hierarchical community structure +\item Evolution of community structure over time +\end{itemize} + +\item Our main focus will be on undirected graphs with no weighted edges. Also the community structure we will be examining will be focused on partitions of the vertices. +\end{itemize} + +\end{frame} + +\subsection{Why study Communities} + +\begin{frame} +\frametitle{Why study Communities} + +\begin{itemize} +%\item Networks and their communities appear in numerous disciplines, ranging from natural and social sciences, to computer sciences and engineering. + +\item Community structure is a fundamental feature in real world networks. + +\item Networks and their communities appear in natural and social sciences, to computer sciences, engineering, and biology, to name a few. + +\item Understanding the communities within a network allows us to: + +\begin{itemize} +\item Gain a better picture of the entire network as a whole + +\item Obtain local data on how the network operates + +\item Identify communities with properties which strongly differ from the average properties of the network. +\end{itemize} +\end{itemize} + +\end{frame} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Modularity} +\subsection{Modularity} + +\begin{frame} +\frametitle{Modularity} + +\begin{itemize} +\item Suppose we have a potential partition $\mathcal{P}=\{C_{1},C_{2},\dots,C_{k}\}$ of a network into communities. + +\item A quality function assigns partitions to numerical values, with the intention of describing whether a partition is a good fit for grouping the network into communities. + +\item The most popular quality function is called \emph{Modularity}. It is defined by + +\[ Q_{\mathcal{P}} = \frac{1}{2K}\sum_{i=1}^{N}\sum_{j=1}^{N} \left(a_{ij} - \frac{k_{i}k_{j}}{2K} \right) \delta(C_{i},C_{j}). \] + +\item Here $a_{ij}$ are the elements of the adjacency matrix, $k_{i}$ is the degree of node $i$, and $\delta$ is the Kronecker delta function. +\end{itemize} + +\end{frame} + +\begin{frame} +\frametitle{Modularity} + +\begin{itemize} +\item The \emph{Modularity} is evaluating the difference between the internal density of each community with the expected density of a random network. + +\item The random network most commonly used is a network where nodes are connected uniformly at random, but we have the same number of nodes, and same degree for each node, as the original network. + +\item \emph{Modularity} is comparable to statistical significant testing, as it gives us a measure on how different our network is to a null model with respect to community structure. + +\item Positive modularity suggests a ``good'' partition, and the higher the better. +\end{itemize} + +\end{frame} + +\subsection{Checking all Partitions of a Graph} + +\begin{frame} +\frametitle{Checking all Partitions of a Graph} + +\begin{itemize} +\item A natural question now is how one would go about finding a partition of a network into communities. + +\item Simply calculating all partitions of the vertices of a network is highly impractical. + +\item The number of partitions of a set of size $n$ is given by $B_{n}$, the $n^{\text{th}}$ Bell number. One can show the following: + +\[ B_{n} = \sum\limits_{k=1}^{n-1}\begin{pmatrix} +n-1 \\ +k +\end{pmatrix}B_{k} \geq \sum\limits_{k=1}^{n-1}\begin{pmatrix} +n-1 \\ +k +\end{pmatrix} = 2^{n-1}. \] + +\item This shows that $B_{n}$ has $2^{n}$ as an asymptotic lower bound + +\item Thus computing $B_{n}$ becomes very ``costly''. As such, enumerating all partitions of a set of vertices of a network is practically impossible with the few exceptions when the network has a very low node count. +\end{itemize} + +\end{frame} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Algorithms for finding Communities} +\subsection{Girvan-Newman Method} + +\begin{frame} +\frametitle{Girvan-Newman Method} + +\begin{itemize} +\item The Girvan-Newman algorithm was one of the first algorithms created to find community structure within a network. + +\item The Idea: +\begin{itemize} +\item Edges Between communities are sparse +\item We expect many shortest paths within the network to pass over these ``bridges''. +\item Finding and removing such edges decomposes the network into connected components, which display the communities at play +\end{itemize} +\end{itemize} + +\end{frame} + +\begin{frame} +\frametitle{Girvan-Newman Method} + +\begin{itemize} +\item To find such edges, we assign a value to each edge referred to as the \emph{betweenness} of the edge. It is the number obtained by dividing the number of shortest paths passing through the given edge by the total number of shortest paths. + +\item The edge \emph{betweenness} gives us a measure of the importance of the edge in traversing the network. As such, we expect the few edges which connect the communities together to have high \emph{betweenness}. + +\item The Girvan-Newman algorithm is then given as follows: +\begin{itemize} +\item[(1)] Evaluate the betweenness of each edge in the network +\item[(2)] Remove the edge of maximal betweenness (chosen arbitrarily if draws occur) +\item[(3)] Re-evaluate the edge betweenness of each edge in the new network obtained from the above step +\item[(4)] Repeat steps 2 and 3 above until no more edges remain +\end{itemize} + +\end{itemize} + +\end{frame} + +\begin{frame} +\frametitle{Girvan-Newman Method} + +\begin{itemize} +\item This algorithm produces a series of graphs, in fact it produces a dendrogram. + +\item The connected components of each such graph correspond to a partition into communities of the original network, and using modularity one can find the best candidates + +\item Remarks: +\begin{itemize} +\item Has a rather high time complexity of $\mathcal{O}(K^{2}N\text{log}(N))$. +\item Has natural generalisations to directed graphs +\item Not as easy to generalise to covers (``overlapping partitions''), but there do exist such modified versions +\item Since we obtain a dendrogram, we obtain hierarchical knowledge on the community structure. +\end{itemize} + +\end{itemize} + +\end{frame} + +\subsection{WalkTrap Algorithm} + +\begin{frame} +\frametitle{WalkTrap Algorithm} + +\begin{itemize} +\item Taking a random walk in a network, one would expect to be trapped within densely connected areas. + +\item WalkTrap tries to exploit this behaviour to identify what collection of nodes are likely to be members of the same community. + +\item An important ingredient in this algorithm is the \emph{transition matrix} $P$, whose $ij$-th entry gives the probability of travelling from node $i$ to $j$ in a single step. + +\item For $t \in \mathbb{N}$, the matrix $P^{t}$ describes the probabilities of going from one node to another in $t$ steps. +\end{itemize} + +\end{frame} + +\begin{frame} +\frametitle{WalkTrap} + +\begin{itemize} +\item Fix $t\in \mathbb{N}$. One uses the matrix $P^{t}$ to define a distance among the nodes of the network. + +\item Nodes of the same community are considered ``close'', while nodes of distinct communities are considered ``far away''. + +\item WalkTrap algorithm: Start with $\mathcal{P}_{1} = \{\{v\}|v \in V\}$. Compute the distances between all adjacent nodes. The partition evolves by repeating the following operations. At each stage $k$ with partition $\mathcal{P}_{k}$: +\begin{itemize} +\item Choose $C_{1}$ and $C_{2}$ of $\mathcal{P}_{k}$ whom uphold some minimality condition based on the distance function. +\item Merge these two communities, i.e. $C_{3}=C_{1}\cup C_{2}$, and create the new partition $\mathcal{P}_{k+1} = (\mathcal{P}_{k}\backslash \{C_{1},C_{2}\})\cup C_{3}$ +\item Update the distances between the communities and repeat +\end{itemize} +\end{itemize} + +\end{frame} + +\begin{frame} +\frametitle{WalkTrap} + +\begin{itemize} +\item We obtain a collection of partitions of the network which form a dendrogram. + +\item We start from the finest partition, and end with the coarsest. + +\item The partitions obtained along the way have been chosen in a greedy fashion, trying to minimise some distance condition between the communities. + +% Belong summarises the conclusion given in the source +\item Remarks: +\begin{itemize} +\item At worst it's time complexity is $\mathcal{O}(KN^{2})$. For sparse networks this can be improved to $\mathcal{O}(N^{2}\text{log}(N))$. +\item Hierarchical community structure obtained via the dendrogram. +\item Much of the proofs and computations are not upheld in the case of directed graphs. +\item If the graph is very large, instead of computing $P^{t}$ directly, one can obtain an approximation by sampling many random walks on the graph. +\end{itemize} +\end{itemize} + +\end{frame} + + +\subsection{The Label Propagation algorithm} + + +\begin{frame} +\frametitle{The Label Propagation algorithm} + +\begin{itemize} +\item The idea for the Label Propagation algorithm is that a node is far more influenced by its community that its external neighbours. + +\item The algorithm goes as follows: +\begin{itemize} +\item[(1)] Give every node a distinct label +\item[(2)] At every iteration of propagation labels, each node updates its label to the one that is most popular with its neighbours (draws are broken uniformly at randomly). +\item[(3)] The process terminates when each node has the majority label of its neighbours. +\end{itemize} +\end{itemize} + +\end{frame} + +\begin{frame} +\frametitle{The Label Propagation algorithm} + +\begin{itemize} +\item As labels move around, one would expect that densely connected areas of nodes will reach a consensus on a unique label, and labels would have trouble crossing a sparsely connected areas. + +\item Thus at the end of the algorithm nodes that have the same label are said to belong to the same community. + +\item Remarks: +\begin{itemize} +\item The algorithm can be semi-supervised by pre-assigning nodes labels. This gives great flexibility in its use. +\item The time complexity is $\mathcal{O}(EK\text{log}(N))$, where $E$ is the number of iterations of the process. +\item May get trapped in loops, but there are means to remedy this issue. +\end{itemize} +\end{itemize} + +\end{frame} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Validation of Community finding Algorithms} +\subsection{Benchmark Networks} + +\begin{frame} +\frametitle{Benchmark Networks} + +\begin{itemize} +\item The algorithms we have discussed are constructed with intuitive properties of what it means to be a community in mind. + +\item However we have not yet mentioned whether the algorithms are indeed good at their job. + +\item The main reason for this is that, at the moment, there is no known best way of validating an algorithm for finding communities. This mainly comes down to the fact that there is no unifying definition of community. + +\item The most popular way of testing out an algorithm and comparing it to others it by running it on benchmark networks whose underlying community structure is known, and comparing the known structure to the one obtained by the algorithm. +\end{itemize} + +\end{frame} + +\begin{frame} +\frametitle{Benchmark Networks} + +\begin{itemize} +\item There are both computer-generated benchmark networks and real-world ones. + +\item The most used benchmark networks range the community structure in precise ways to try an isolate certain difficulties and aspect one would want a community finding algorithm to over come if it is to be considered good. + +\item Many popular benchmark models are inspired by what is called the Stochastic Block Models. The idea is we pre-group the nodes, and assign an edge to two given nodes with probability depending on what groups the nodes belong. + +\item A simple case would be to have two probabilities, $p_{in}$ for pairs of nodes within the same group, and $p_{out}$ for pairs of nodes of distinct group, and having $p_{in}>p_{out}$. + +\item Such a random graph would thus be built with a community structure in mind, and the greater the difference between $p_{in}$ and $p_{out}$, the easier we would expect the communities to be detectable. +\end{itemize} + +\end{frame} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\end{document}