COMP3220 Document Processing and Semantic Technologies My home / My units / COMP3220_FHFYR_2022_ALL_U / Assessments / Assignment 3: Specification Assignment 3: Specification
RDF Knowledge Graph Construction Change log:
[2022-04-29] Updated the zip folder: assignment-3-start.zip.
1. Introduction
For this assignment you have to retrieve a number of simple sentences from an existing web page, extract the relevant information from these sentences and transform this information into an RDF knowledge graph with the help of the RDF Mapping Language (RML). Once this RDF knowledge graph is available, you have to display it in Turtle notation. You also have to write a number of SPARQL queries of your choice that extract information from the knowledge graph. Finally, you have to produce a 3-minute video that explains the details of your Python implementation and how you built the RML mapping rules.
Please download the folder “assignment-3-start.zip” to start with this assignment. This folder contains a version of SDM-RDFizer, a configuration file “config.ini” for the RDFizer, an incomplete file “mapping.ttl” for the RML mapping rules to be added, and an HTML file “student.html” that contains the information to be extracted.
2. Extracting Information
Fetch the following web page (“student.html”) from a browser via a HTTP request:
To do this, use the command prompt and go to the folder where the HTML file “student.html” is located and start a simple Python HTTP server from the command line:
The Python program “student.py” should request the HTML document (“student.html”) from the browser via: You have to use the Python “requests” module for this task.
Afterwards, use the BeautifulSoup4 library to extract the raw text from the HTML file “student.html” and spaCy to extract the sentences from the text and store these sentences in a list:
C:>python -m http.server 8080
http://localhost:8080/student.html
[‘Kevin Walker is a student.’,
‘Kevin Walker was born on 2001/07/24.’,
‘Kevin Walker lives in Epping.’,
‘Kevin Walker works at ALDI.’,
‘Kevin is a friend of Alice Miller.’,
‘Kevin is studying at Macquarie University.’,
‘He has the student number 40048822.’,
‘He is enrolled in COMP3100 and in COMP3220.’,
‘Alice Miller is an alumna of Macquarie University.’,
‘She is a friend of him.’]
For each of these ten sentences extract the subject (source), the predicate (edge), and the object(s) (target) and store the resulting information in exactly the same way as shown below in a pandas DataFrame. Again, you have to use spaCy to extract the relevant information from these sentences. It is up to you, if you only want to use the linguistic features that are available as token attributes in spaCy or spaCy’s matcher engine or a combination of both to extract the information. Note that you have also to resolve anaphoric expressions during this extraction process like: him –> he –> Kevin –> Kevin Walker and normalise these expressions (as illustrated below).
3. Adding RML Mapping Rules
Take the DataFrame and translate the information in this DataFrame into suitable csv files that serve as data sources for the RML mapping document “mapping.ttl”. Note that the file “mapping.ttl” initially contains only the prefixes of the IRIs for the N-Triples:
Add RML mapping rules to the file “mapping.ttl” that transform the information in the csv files into N-Triples notation. Once the RML mapping rules are defined, you can launch the transformation in the following way from your Python program:
Note that you may also have to install the following two modules in order to run the rdfizer:
@base .
@prefix foaf: .
@prefix xsd: .
@prefix schema: .
@prefix rr: .
@prefix rml: .
@prefix ql: .
@prefix rdf: .
import os
os.system(“python -m rdfizer -c ./config.ini”)
pip install mysql-connector-python
pip install psycopg2
If the transformation was successful, then the file “triples.nt” will contain the following N-Triples:
. “2001-07-24″^^. . . . . “40048822”^^. . . .
You can visualise the resulting N-Triples as a connected graph. You don’t have to generate this graphical representation for this assignment, but the graph may help you to inspect the triples, when you develop the mapping rules. I used the online RDF Grapher for this purpose.
source
0 Kevin Walker
1 Kevin Walker
2 Kevin Walker
3 Kevin Walker
4 Kevin Walker
5 Kevin Walker
edge
6 Kevin Walker has student number
40048822
COMP3100
COMP3220
7 Kevin Walker
8 Kevin Walker
9 Alice Miller
10 Alice Miller
is enrolled in
is enrolled in
target
student
2001-07-24
Epping
ALDI
Alice Miller
is studying at Macquarie University
is a
born on
lives in
works at
is friend of
is alumna of Macquarie University
is friend of Kevin Walker
4. Displaying N-Triples in Turtle Notation
Use Python’s rdflib library, read the file “triples.nt”, and display these triples of the knowledge graph in Turtle notation. The output should look as follows:
@prefix foaf: .
@prefix ns1: .
@prefix xsd: .
ns1:alumniOf ;
foaf:knows .
a ;
ns1:addressLocality ;
ns1:birthDate “2001-07-24″^^xsd:date ;
ns1:courseCode ,
;
ns1:identifier “40048822”^^xsd:positiveInteger ;
ns1:study ;
ns1:workLocation ;
foaf:knows .
5. Querying the RDF Knowledge Graph
Translate the following six questions into SPARQL queries and answers these queries over in the RDF knowledge graph. Use JSON notation to display the answer for each question as illustrated below:
Who is a student
[{‘who’: ‘Kevin%20Walker’}]
When was Kevin Walker born
[{‘when’: ‘2001-07-24’}]
Where does Kevin Walker live and work
[{‘where_addr’: ‘Epping’, ‘where_loc’: ‘ALDI’}]
Who is a friend of whom
[{‘who’: ‘Kevin%20Walker’, ‘whom’: ‘Alice%20Miller’}, {‘who’: ‘Alice%20Miller’, ‘whom’: ‘Kevin%20Walker’}]
Who is an alumna of Macquarie University and a friend of Kevin Walker
[{‘who’: ‘Alice%20Miller’}]
In how many courses is Kevin Walker enrolled
[{‘count’: ‘2’}]
6. Producing a Video
Produce a 3-minute video (“student.mp4”) that presents your implementation. In this video, you should walk the spectator through the code of your Python program and your RML mapping rules and explain the details of your implementation in your own words. Focus on those parts of the implementation that are novel and haven’t already been discussed in the practical tasks of Week 7 and 8. You can use the free screen recorder FlashBack Express or Zoom to produce your video.
7. Assessment
This assignment is worth 20 marks in total. You will be awarded marks for the correctness and quality of your code and the content of your video according to the following criteria:
Criteria
Marks Explanation
Code Quality 3 Information 5 Extraction
RML Rules 4
N-Triples in Turtle 1 Notation
SPARQL Queries 3 Video 4
8. Submission
You have to submit the original Zip folder (“comp3220-assignment-3.zip”) that now also contains your Python code “student.py”, the modified file “mapping.ttt” with the RML mapping rules, the csv files that serve as data sources to the RML mapping rules, the file “triples.nt” that contains the generated N-Triples, and your video “student.mp4”. Please do not remove the folder “rdfizer” from the original zip folder!
Note that you have to submit a Python program (“student.py”) as part of this assignment and not a Jupyter notebook. Please submit this Zip folder via iLearn before Friday, 27th May 2022, 17:00.
Please contact Rolf Schwitter, if you have any questions about this assignment.
Last modified: Friday, 29 April 2022, 11:00 PM
Your code follows a consistent style, is easy to understand, and has been well-documented.
All the textual information is extracted and represented exactly as illustrated in the above-mentioned pandas DataFrame (0.5 mark for each of the 10 original sentences).
The information in the pandas DataFrame is translated into suitable csv files of your choice (1 mark); RML mapping rules then take these csv files as data sources and produce N-Triples as output (in the file “triples.nt”) as illustrated above (3 marks).
Using Python’s rdflib library, the N-Triples are read from the file “triples.nt” and displayed in Turtle notation.
The six SPARQL queries that are answered over the knowledge graph and the correct answers to these queries are displayed in JSON notation as illustrated above.
The video explains the Python code (2 marks) and the RML mapping rules (2 marks) in detail, with a particular focus on the new elements of your solution.
General Students