2022/7/23 20:55 COMP9024 22T2 – Assignment https://cgi.cse.unsw.edu.au/~cs9024/22T2/assn/index.php 1/6 COMP9024 22T2 Assignment MyPageRank Data Structures and Algorithms Change Log We
may make minor changes to the spec to address/clarify some outstanding
issues. These may require minimal changes in your design/code, if at
all. Students are strongly encouraged to check the online forum discussion and the change log regularly. Version 0.1 (2022-07-11 15:00) Initial release. Version 0.2 (2022-07-14 13:00) Assignment Files: added micro-web link/image. Version 0.3 (2022-07-14 17:00) Subset 0: explicitly noted that string.h may be used and memory for list/graph structures should be allocated dynamically. Assessment: added submission and due date details. Version 0.4 (2022-07-14 18:15) Subset 3: in crawler.c code stub replaced fgets and strlen lines with call to scanf. Version 0.5 (2022-07-21 02:15) Subset 1: added sample output for micro-web, nothing to ignore. Subset 2: added sample output for micro-web, nothing to ignore. Version 0.6 (2022-07-21 04:15) Subset 1: added testing. Subset 2: added testing. Version 0.7 (2022-07-21 13:50) Subset 3: added sample output for micro-web, nothing to ignore. Subset 3: added testing. Version 1.0 (2022-07-21 22:00) Subsets 1, 2, 3: added sample output and dryrun test cases with ignore lists. Assignment Files Download this zip file. Unzipping the file will create a directory called ‘files’ with all the assignment start-up files. You can also make note of the following URLs: http://www.cse.unsw.edu.au/~cs9024/micro-web http://www.cse.unsw.edu.au/~cs9024/mini-web Here is a visual representation of the micro-web: Once
you read the spec, hopefully it will be clear to you what you could use
these URLs for! You may also find it useful to construct a similar
graph visualisation for the mini-web. Background As we have
mentioned in lectures, the Internet can be thought of as a graph (a very
large graph). Web pages represent vertices and hyperlinks represent directed edges. With
over 1.2 Billion unique websites (as of June 2021) and each website
having multiple webpages and each webpage having multiple hyperlinks it
can understandably be a very difficult job to remember the URL of every website you want to visit. In
order to make life easier, from the very early days of the internet,
there have been search engines that can be used to find websites. But
the job of a search engine is very difficult: First it must index
(create a representation of) the entire (or as close to it as possible)
internet. Next it must rank the webpages it finds. In this assignment we will be implementing algorithms to solve each of these problems. 1. To index the internet we will be creating a web crawling. 2. To rank webpages we will implement the pagerank algorithm. 3. To find the shortest path between two pages we will implement dijkstra’s algorithm The Assignment Subset 0 – Dependencies Before
we can start crawling we need to be able to store our crawled data. As
the internet is a Graph, this means we need a Graph ADT. We will also
need a Set ADT and one of a Queue ADT or a Stack ADT in order to perform web scraping (for a BFS or DFS). Subset 0a – Implement List (Queue, Stack, Set) ADT Task 2022/7/23 20:55 COMP9024 22T2 – Assignment https://cgi.cse.unsw.edu.au/~cs9024/22T2/assn/index.php 2/6 You
have been provided with a file list.h. Examine the given list.h file
carefully. It provides the interface to an ADT that will provide Queue,
Stack, and Set functionality. You should create the file list.c to implement this ADT. Your list ADT should store string (char *) data. You may use the string.h library and other standard libraries from the weekly exercises. You are required to dynamically allocate memory in your ADT for full style marks. You can use whatever internal representation you like for your list ADT. You can reuse the code previously submitted for weekly assessments in your list ADT. As a reminder: Queue – First In, First Out Stack – First In, Last Out Set – Only stores unique values. See list.h for more information about each function that you are required to implement. Remember that you cannot modify list.h. You should write programs that use the list ADT to test and debug your code. Subset 0b – Implement Graph ADT Task You
have been provided with a file graph.h. Examine the given graph.h file
carefully. It provides the interface to an ADT that will provide Graph
functionality. You should create the file graph.c to implement this ADT. Your graph ADT should store string (char *) data in its vertices. You may use the string.h library and other standard libraries from the weekly exercises. You are required to dynamically allocate memory in your ADT for full style marks. You must use an adjacency list representation for your ADT. But the exact implementation is up to you. You can reuse the code previously submitted for weekly assessments in your graph ADT. See graph.h for more information about each function that you are required to implement. Note that graph.h indicates that each edge has a weight. Remember that you cannot modify graph.h. You should write programs that use the graph ADT to test and debug your code. Subset 1 – Web Crawler Task We are now going to use the list and graph ADTs you have created to implement a web scraper. After
compiling with make crawler (using Makefile given), run the executable
and check that the output aligns with the navigation of the sample
website. Once you have completed the appropriate list functions, carefully examine the code crawler.c. Uncomment
the block of code that uses ‘scanf’ to take user inputs for ‘ignore
list’ in the provided implementation, so that users can provide some
URLs as input. The ignore list represents the URLs that we would
like to completely ignore when you are calculating page ranks, as if
they did not exist in the graph. This means that any incoming and
outgoing edges from these URLs are treated as non-existent. You are
required to implement this functionality locally – within the graph_show() function – and NOT to change the actual graph ADT. For more details see the graph.h file. crawler.c requires external dependencies (libcurl and libxml2) The provided Makefile will work on CSE servers (ssh and vlab) but might not work on your home computer. If you have correctly implemented the ADTs from the previous tasks, this part should be mostly free. Note:
crawler.c is a complete implementation of a web crawler; you do not
need to modify the utility functions, only the bottom part of the main
function. However, you should look at the program carefully and
understand it well so that you can use it (i.e., modify it
appropriately) for later tasks. Sample Output Using a modified
crawler.c that simply calls graph_show() on the micro-web, and without
ignoring any pages, the output should be: prompt$ ./crawler http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html Enter a page to ignore or type ‘done’: done http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html http://www.cse.unsw.edu.au/~cs9024/micro-web/X.html http://www.cse.unsw.edu.au/~cs9024/micro-web/Y.html http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html http://www.cse.unsw.edu.au/~cs9024/micro-web/X.html 1 http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html http://www.cse.unsw.edu.au/~cs9024/micro-web/Y.html 1 http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html 1 http://www.cse.unsw.edu.au/~cs9024/micro-web/X.html http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html 1 http://www.cse.unsw.edu.au/~cs9024/micro-web/Y.html http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html 1 Now lets add index.html to the ignore list: prompt$ ./crawler http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html Enter a page to ignore or type ‘done’: http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html Enter another page to ignore or type ‘done’: done http://www.cse.unsw.edu.au/~cs9024/micro-web/X.html 2022/7/23 20:55 COMP9024 22T2 – Assignment https://cgi.cse.unsw.edu.au/~cs9024/22T2/assn/index.php 3/6 http://www.cse.unsw.edu.au/~cs9024/micro-web/Y.html http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html All
traces of index.html have been removed. This means that only the
remaining vertices are displayed as there are no longer any edges. Note
that the order of the output matters. It should follow the BFS that
is performed by the crawler. If your result does not follow this order,
you will be marked as incorrect, even if your graph is valid. Testing We
have created a script to automatically test your list and graph ADTs.
It expects to find list.c and graph.c in the current working directory.
Limited test cases are provided, so you should always do your own, more thorough, testing. prompt$ 9024 dryrun assn_crawler Subset 2 – PageRank Background Now
that we can crawl a network and build a graph, we need a way to
determine what pages in our network (vertices) are important. We
haven’t kept page content so the only metric we can use to determine
importance of a page is to check how much other pages rely on its
existence. That is: how easy is it to follow a sequence of one or more links (edges) and end up on the page. In 1998 Larry Page and Sergey Brin created the PageRank algorithm to evaluate this metric. Google still uses the PageRank algorithm to score every page it indexes on the internet to help order its search results. Task In graph.c implement the two new functions graph_pagerank() and graph_viewrank(). First,
graph_pagerank() should calculate the PageRank of every page (vertex)
in your graph. The algorithm should store and calculate the page rank in
the ADT. The algorithm must exclude the URLs that are provided
in an ‘ignore list’ to the function. Do not remove the pages from the
graph, only skip (i.e., ignore) them for calcualtion. This means that you will need to understand what parts of the Page Rank algorithm need to be modified. Using
the ignore list, you will be able to see what happens to the page ranks
as certain pages are removed. What should happen to the pagerank of a
certain page if you remove all pages pointing to it Second, graph_viewrank() should print the PageRank of every page (vertex) in your graph that is NOT in the ignore list. Pages
(vertices) should be printed from highest to lowest rank, if two pages
have the same rank then they should be printed alphabetically. You can add utility functions to graph.c You can (and most likely will need to) modify your struct definitions in graph.c You cannot modify the file graph.h You cannot modify the function prototypes for graph_pagerank() and graph_viewrank() Algorithm For : for : Where: is the number of vertices and are each some vertex is the “time” (iteration count) is the PageRank of vertex at some time is the damping_factor is the set of vertices that have an outbound edge towards is the PageRank of vertex at some time is the degree of vertex , ie. the number of outbound edges of vertex is the set of sinks, ie. the set of vertices with no outbound edges, ie. where is 0 This is formula is equivalent to the following algorithm: procedure graph_pagerank(G, damping_factor, epsilon) N = number of vertices in G for all V in vertices of G oldrank of V = 0 pagerank of V = 1 / N end for while |pagerank of V – oldrank of V| of any V in vertices of G > epsilon for all V in vertices of G oldrank of V = pagerank of V end for sink_rank = 0 for all V in vertices of G that have no outbound edges t = 0 PR(p i ; t) = 1 N t > 0 PR(p i ; t) = 1 d N + d× (( ∑ p j ∈M(p i ) PR(p j ; t 1) D(p j ) ) + (∑ p j ∈S PR(p j ; t 1) N )) N p i p j t PR(p i ; t) p i t d M(p i ) M(p i ) PR(p j ; t 1) p j t 1 D(p j ) p j p j S D(p j ) 2022/7/23 20:55 COMP9024 22T2 – Assignment https://cgi.cse.unsw.edu.au/~cs9024/22T2/assn/index.php 4/6 sink_rank = sink_rank + (damping_factor * (oldrank of V / N)) end for for all V in vertices of G pagerank of V = sink_rank + ((1 – damping_factor) / N) for all I in vertices of G that have an edge from I to V pagerank of V = pagerank of V + ((damping_factor * oldrank of I) / number of outbound edges from I) end for end for end while end procedure In
order to test your pagerank functions, you should modify crawler.c to
#include “pagerank.h”, and change the last part of the main function to
something like: … graph_show(network, stdout, ignore_list); graph_pagerank(network, damping, delta, ignore_list); graph_viewrank(network, stdout, ignore_list); list_destroy(ignore_list); graph_destroy(network); where you choose appropriate values for damping and delta (aka epsilon). Again
it is noted that the changes you make to crawler.c are purely for you
to test whether your pagerank functions are working. We use a different
crawler.c for testing your pagerank functions. Sample Output Here
we’re using a modified crawler.c that calculates graph_pagerank() and
prints graph_viewrank(). Damping has been set to 0.85 and epsilon/delta
to 0.00001. For the micro-web, and without ignoring any pages, the output should be: prompt$ ./crawler http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html Enter a page to ignore or type ‘done’: done http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html (0.412) http://www.cse.unsw.edu.au/~cs9024/micro-web/X.html (0.196) http://www.cse.unsw.edu.au/~cs9024/micro-web/Y.html (0.196) http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html (0.196) Now lets add index.html to the ignore list: prompt$ ./crawler http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html Enter a page to ignore or type ‘done’: http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html Enter another page to ignore or type ‘done’: done http://www.cse.unsw.edu.au/~cs9024/micro-web/X.html (0.333) http://www.cse.unsw.edu.au/~cs9024/micro-web/Y.html (0.333) http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html (0.333) X.html,
Y.html and Z.html have no connections anymore and as such have the same
ranks. Note that the sum is still equal to 1, and N, the number of
vertices, is equal to 3 in this case, since there were a total of 4 nodes originally, and 1 of the nodes has been ignored. Testing We
have created a script to automatically test your PageRank algorithm. It
expects to find list.c and graph.c in the current working directory.
Limited test cases are provided, so you should always do your own, more thorough, testing. prompt$ 9024 dryrun assn_rankings Subset 3 – Degrees of Separation (Shortest Path) Task In graph.c implement the two new functions graph_shortest_path() and graph_view_path(). First, graph_shortest_path() should calculate the shortest path between a source vertex and all other vertices. graph_shortest_path() should use dijkstra’s algorithm to do so. Note
that an ignore list is also passed into graph_shortest_path(). Similar
to above, you will need to ensure these URLs are treated as
non-existent. For example if there was a path A->B->C, but B is ignored, then there is no path from A to C. Unlike
a regular implementation of dijkstra’s algorithm, your code should
minimise the number of edges in the path (not minimise the total weight
of the path). Second, graph_view_path() should print the path from
the previously given source vertex to a given destination vertex. With
the ignore list, there can be no path between two vertices. In this case, output nothing. You can add more utility functions to graph.c You can (and most likely will need to) extend your struct definitions in graph.c You cannot modify the file graph.h You cannot modify the function prototypes for graph_shortest_path() and graph_view_path() In
order to test your dijkstra functions, you should modify crawler.c to
add #include “dijkstra.h”, and change the last part of the main function
to something like: … graph_show(network, stdout, ignore_list); graph_shortest_path(network, argv[1], ignore_list); char destination[BUFSIZ]; printf(“destination: “); scanf(“%s”, destination); 2022/7/23 20:55 COMP9024 22T2 – Assignment https://cgi.cse.unsw.edu.au/~cs9024/22T2/assn/index.php 5/6 graph_view_path(network, destination, ignore_list); list_destroy(ignore_list); graph_destroy(network); The
changes you make to crawler.c are purely for you to test whether your
dijkstra functions are working. We use a different crawler.c for testing
your dijkstra functions. Sample Output Using a modified
crawler.c that accepts a source page as a command line argument from
which to calculate graph_shortest_path(), and a destination page to
output graph_view_path(), for the micro-web, and without ignoring any
pages, the output in tracing a path from X.html to Z.html should be: prompt$ ./crawler http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html http://www.cse.unsw.edu.au/~cs9024/micro-web/X destination: http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html Enter a page to ignore or type ‘done’: done http://www.cse.unsw.edu.au/~cs9024/micro-web/X.html http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html Now let’s add index.html to the ignore list: prompt$ ./crawler http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html http://www.cse.unsw.edu.au/~cs9024/micro-web/X destination: http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html Enter a page to ignore or type ‘done’: http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html Enter another page to ignore or type ‘done’: done prompt$ Since
index.html has been ignored, the path cannot be completed and nothing
is returned. Your algorithm should iterate vertices/pages in the same
order as the crawler. This ensures that when your algorithm finds
the shortest path, it will return the first path it would encounter from
the BFS in the crawler. If your result does not follow this order, you will be marked as incorrect, even if your path is valid. Testing We
have created a script to automatically test your shortest path
algorithm. It expects to find list.c and graph.c in the current working
directory. Limited test cases are provided, so you should always do your own, more thorough, testing. prompt$ 9024 dryrun assn_path Assessment Submission Submit your list.c and graph.c files using the following give command: prompt$ give cs9024 assn list.c graph.c Make sure you spell the filenames correctly. You can run give multiple times. Only your last submission will be marked. You can check your latest submission on CSE servers with: prompt$ 9024 classrun check assn If
you are working at home, you may find it more convenient to upload your
work via the “Make Submission” tab at the top of this assignment page
on WebCMS, or via give’s web interface. Due Date This assignment is due Week 10, Wednesday, 3 August 23:59:59 (i.e. before midnight). The UNSW standard late penalty for assessment is 5% per day for 5 days – this is implemented hourly for this assignment. Each hour your assignment is submitted late reduces its mark by 0.2%. For example, if an assignment worth 60% was submitted 10 hours late, it would be awarded 58.8%. Beware – submissions more than 5 days late will receive zero marks. This again is the UNSW standard assessment policy. Assessment Scheme Total 12 marks allocated for this assignment: 2
marks for quality/correctness of the code to the assignment
specification and coding style (readability, comments, etc.) … manual
inspection of the code by the tutors 6.5 marks for page ranking implementation … automark/testing 3.5 marks for shortest path implementation … automark/testing Important: Any
submission that does not allow us to follow the above mentioned marking
procedure “normally” (e.g., missing files, compile errors) will result
in your submission not getting marked in time. Depending on the
severity of the errors/problems, we may ask you to resubmit (with max
late penalty) or assess your written code instead (e.g., for some “effort” marks only). Ensure your submitted code compiles on a CSE machine using dcc with the standard options -Wall -Werror -std=c11. Plagiarism Group
submissions will not be allowed. Your program must be entirely your own
work. Plagiarism detection software will be used to compare all
submissions pairwise (including submissions for similar assignments
in previous years, if applicable) and serious penalties will be applied,
particularly in the case of repeat offences. 2022/7/23 20:55 COMP9024 22T2 – Assignment https://cgi.cse.unsw.edu.au/~cs9024/22T2/assn/index.php 6/6 Do not copy ideas or code from others. Do not use a publicly accessible repository or allow anyone to see your code. Please refer to the on-line resources to help you understand what plagiarism is and how it is dealt with at UNSW: Academic Integrity and Plagiarism UNSW Plagiarism Policy Statement UNSW Plagiarism Management Procedure Copyright Reproducing,
publishing, posting, distributing or translating this assignment is an
infringement of copyright and will be referred to UNSW Student Conduct
and Integrity for action.