程序案例-B452F-Assignment 1

1 BIA B452F Assignment 1 Weighting: 25% Deadlines: Part A (10%) – 26 March 2021 (Friday) Part B (15%) – 9 April 2021 (Friday) Learning outcome: Explain and select analytic techniques for business intelligence and big data analysis. Apply data visualization tools and predictive analytics to summarize and analyze business data. Important note: You should note that there might not be a single correct answer to the questions. Your answers to these questions may be different from each other and could all be equally valid. This is an individual assignment. Copying some or all of another student’s assignment is plagiarism. Discussing your assignments with other students and seeking their comments and advice is acceptable but it is not acceptable for two students to hand in assignments that are substantially the same. When you collaborate on an individual assignment, it is important that the final product is your own work. Investigating Online Shoppers’ Purchasing Intention In this assignment, you will perform exploratory and clustering analyses to investigate the online shoppers’ shoppers’ purchasing intention based on the clickstream data obtained from the navigation path of online shoppers. The numerical and categorical features used in studying the online shoppers’ purchasing intention are given in table below. The sample Dataset “online_shoppers.csv” consists of feature vectors belonging to 12,330 sessions. Table 1 – Numerical features Feature name Feature description Administrative Number of pages visited by the visitor about account management Administrative duration Total amount of time (in seconds) spent by the visitor on account management related pages Informational Number of pages visited by the visitor about Web site, communication and address information of the shopping site Informational duration Total amount of time (in seconds) spent by the visitor on informational pages Product related Number of pages visited by visitor about product related pages Product related duration Total amount of time (in seconds) spent by the visitor on product related pages Bounce rate Average bounce rate value of the pages visited by the visitor “Bounce Rate” feature for a Web page refers to the percentage of visitors who enter the site from that page and then leave (‘‘bounce’’) 2 without triggering any other requests to the analytics server during that session Exit rate Average exit rate value of the pages visited by the visitor The value of ‘‘Exit Rate’’ feature for a specific Web page is calculated as for all pageviews to the page, the percentage that were the last in the session. Page value Average page value of the pages visited by the visitor The ‘‘Page Value’’ feature represents the average value for a Web page that a user visited before completing an e-commerce transaction. Special day Closeness of the site visiting time to a special day Table 2 – Categorical features Feature name Feature description Month Month of the site visiting time OperatingSystems Operating system of the visitor Browser Browser of the visitor Region Geographic region from which the session has been started by the visitor TrafficType Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, direct) VisitorType Visitor type as: 0 – Returning Visitor, 1 – New Visitor, and -1 – Other Weekend Boolean value indicating whether the date of the visit is weekend Month Month value of the visit date Revenue Boolean value indicating whether the visit has been finalized with a transaction Part A – Exploratory Analysis (40 marks) In this task, you have to apply exploratory analysis to reveal online shoppers’ purchasing intention that could be used for formulating customized promotions to the online shoppers. You have to define your own research questions and use summary statistics and data visualization to perform initial investigations on data. For example, you can use correlation analysis to identify the factors (i.e., variables) that allow to predict visitor’s purchasing intention and likelihood to abandon the site and then define and use appropriate criteria to generate graphic representations showing the impact of these factors. You are also required to clearly explain your observations. You have to pre-process the data for constructing data visualizations. For example, you have to handle the missing data, coding the variables, and perform data aggregation. You may use different approach to handle the missing data and make any reasonable assumption in the analysis, if necessary. You may also use any appropriate visualization methods in your analysis. But, you have to justify your methods and assumptions made. 3 Part B – Clustering Analysis (60 marks) (a) In this task, you will use unsupervised learning to segment the sample dataset. You have to apply K- means and Expectation Maximization Algorithm to cluster the online shoppers’ purchasing intention dataset and interpret the cluster results. You must decide which features to be included in the clustering. The main purpose of this analysis is to help the business better understand how to utilize for predicting the behavior of online shoppers in real time and take actions accordingly to improve the shopping cart abandonment and purchase conversion rates. Specifically, you have to perform the following tasks: Load and prepare the data (e.g. data cleansing and data normalization). Train a K-means on the data, select k based on scree plot and Silhouette plot. Rerun the model with optimal no. of clusters. Apply Expectation Maximization (EM) Algorithm to cluster the data. Compare the result of the two methods based on Silhouette plot and Dunn index and select the best clusters. Perform exploratory analysis on the clusters (e.g. descriptive statistics, 2D and 3D scatterplots, histograms, correlation analysis, etc.) and interpret the clustering results. (Note: you can only use numerical variables for K-means and EM.) (50 marks) (b) Explain why K-means can only use numerical variables for clustering and discuss how clustering mixed data types (i.e., both numerical and categorical variables) in R. (Note: you don’t need to write the R program.) (10 marks) Grading Criteria Each submission will be graded based on both the analysis process and included visualizations. Here are our grading criteria: Appropriate data cleansing and transformation. Sufficient breadth of analysis, exploring multiple questions. Sufficient depth of analysis, with appropriate follow-up questions. Expressive & effective visualizations crafted to investigate analysis questions. Clearly written, understandable captions that communicate primary insights. Submission Details Your completed works should be uploaded to OLE before deadline as follows: 1. Part A – Exploratory Analysis (Mar 19, Friday) Analysis report – “Assignment 1 (Part A)” R program (or R markdown) – “Assignment 1 (Part A) – R program” 2. Part B – Clustering Analysis (Apr 9, Friday) Analysis report – “Assignment 1 (Part B) R program (or R markdown) – “Assignment 1 (Part B) – R program” Marks will be deducted if any non-compliance with the submission requirements.