National Gallery of Canada (ceiling) by nivekneerg35

Detailed Program of the Graduate Student Symposium

This is the detailed program for the Graduate Student Symposium.

Understanding and Detecting Traces of Cyberbullying in Online Messages (172)

Marc-André Larochelle

Bullying is an old phenomenon which has since entered the digital space of online connectivity under the name of cyberbullying. Although the former has been known for quite a while, defining cyberbullying has been an ongoing challenge and has led to many heterogeneous datasets. In order to better understand the issue, we will consider the cross-domain generalisation of deep learning models between different definitions of disruptive behaviors, such as hate speech, obscenity and cyberaggression. We then further investigate the similarity between each dataset using similarity metrics and dimensionality reduction methods, such as t-distributed stochastic neighbor embedding. These insights will then provide additional information as to what defines traces of cyberbullying and how it relates to other disruptive behaviors. Finally, we will evaluate different learning mechanisms, model variants and whether the usage of sentiments analogies can helps identify traces of cyberbullying.

Phonetic Normalization and Toxicity Detection within Subversive Messages in Online Speech (173)

Charles Poitras

Online communities and electronic means of communication becoming more prevalent as time goes, the need for moderation and protection of vulnerable communities becomes a critical matter. While it is true that methods such as blacklists, whitelists, and others can be used to filter content, those methods can be subverted by toxic users. One of the ways commonly used is the substitution of a complete word by a phonetically similar word, or combination of letters that resembles the intended word. This paper aims to improve the detection of toxic behaviors by predicting how a particular word might sound phonetically, inferring the intended word from its pronunciation, and using this information to classify potentially toxic messages with varying amounts of subversive content.

Personalized Student Attribute Inference (174)

Khalid Moustapha Askia

Accurately predicting their future performance can ensure students successful graduation, and help them save both time and money. However, achieving such predictions faces two challenges, mainly due tothe diversity of students’ background and the necessity of continuously tracking their evolving progress. The goal of this work is to create a system able to automatically detect students in difficulty, for instance predicting if they are likely to fail a course. We compare a naive approach widely used in the literature, which uses attributes available in the dataset (like the grades), with a personalized approach we called Personalized Student Attribute Inference(PSAI). With our model, we create personalized attributes to capture the specific background of each student. Both approaches are compared using machine learning algorithms like decision trees, support vector machine or neural networks.

Towards Analyzing the Sentiments in the Fields of Automobile and Real-Estates with Specific Focus on Arabic Language Text (177)

Ayman Yafoz

There is a lack of contributions addressing sentiment analysis on the online automobile and real-estates reviews in the Arabic language, particularly in the Gulf Cooperation Council (GCC). Moreover, in the Arabic language domain, there is a lack of available annotated datasets covering specific domains (such as real-estates and automobiles). Furthermore, the limited and inadequate adoption of natural language processing and machine learning techniques in the current sentiment analysis contributions in the Arabic language is noteworthy. If the datasets are prepared and the previously mentioned techniques are adopted, customized and enhanced, they could add a new scope to the Arabic sentiment analysis field and improve the quality of Arabic sentiment analyzers. The aforementioned factors encouraged and motivated us to conduct this research in order to fill the current gap in this area.

Self-Implanted Classifiers for Fine-tuning Deep Neural Network (178)

Farshid Varno, Lisa Di Jorio and Stan Matwin

Despite showing a great capacity, deep neural networks still have a challenging and time-consuming training process. However, an immediate good and generalized performance is vital to many real-world applications. A well-known way of speeding-up this process in neural networks is to fine-tune a pre-trained model. This usually requires the model to be altered based on the definition of the target task. Such modifications usually involve appending a set of parameters to the model. The appended parameters are often initialized randomly whereas we argue that doing this carelessly can slow-down the training and notably delay the convergence through a phenomenon known as catastrophic forgetting. We address this problem with a novel class of classifiers which we call Self-Implanted Classifiers (SIC).

Deep Learning to Reconstruct Particle Interaction Properties for the SNO+ Neutrino Physics Experiment (182)

Mark Anderson

Neutrinos are tiny particles with no electric charge. Although neutrinos are the most abundant particle in the Universe, they typically pass through matter without interacting, making them difficult to study. Despite their size, these elusive particles may provide answers to fundamental questions in physics, including the reason behind the dominance of matter in the Universe that allows us to exist today. The SNO+ particle astrophysics experiment aims to advance our understanding of neutrinos by searching for certain rare event interactions. However, these rare signals are dominated by background processes and detector noise, leading to the collection of a massive amount of data. At the same time, traditional data processing and analysis approaches are time consuming, resource intensive, and difficult to parallelize. Machine learning presents an appealing alternative to many tasks due to quick inference that takes advantage of modern hardware and parallel computing capabilities. Despite its proven successes in many other fields, machine learning is underutilized in particle astrophysics and on SNO+. In this work, we present a general deep learning framework for handling the large, sparse dataset from the SNO+ experiment. We then apply this framework to reconstruct the position and direction of events in the detector – a necessary task to distinguish signal from background. Our approach improves the reliability of predictions while being up to 10,000 times faster than existing event reconstruction algorithms.

Vehicle Traffic Estimation using Weather Data and Calendar Data (187)

Meetkumar Patel and Daniel Silver

Vehicular traffic is important aspect for the commuters. However, there is no application available which can give information about traffic prior to their journey. Thus, the problem is to define and develop an application which can predict vehicular traffic density and flow rate based on the weather data and temporal date data. This information might be valuable for Transportation and Infrastructure department in better planning the maintanance of the road as well as in handelling traffic and to the commuters in planning their journy. The proposed research will combine image processing and machine learning methods.

Predicting Aggressive Responsive Behaviour Among People Suffering from Dementia (190)

Maryam Tajeddin

Patients with dementia will have difficulty properly communicating life’s challenges and instead can become agitated, resulting in verbal or physical aggression. Monitoring the risk of a resident harming themselves or others due to aggressive behavior is a priority within a long-term care facility where dementia is present. Caregivers at long term care facilities record resident health and behaviour digitally either as structured data or unstructured text, providing an on-going log of each resident’s patient history. We aim to use natural language processing (NLP) and machine learning (ML) techniques to develop models that can predict the probability of a resident exhibiting aggressive behaviours that may harm themselves or others within the next week.

Proposed Direct and Indirect Citation Weighting Methods Based on Citation Context Similarity (168)

Toluwase Asubiaro

The most common metric for evaluating the influence of scientific publications is the citation count. Citation count and citation count-based metrics such as impact factor and h-index are overly limiting because they do not consider citation weights: the composite measure of the relative contribution of cited works to the citing document. Additionally, citation count does not consider the influence of a scientific paper beyond the direct citations it has received. Hypothetically, if paper B has cited paper A, and paper C has cited paper B, but paper C has not cited paper A; citation count can neither communicate the probable influence of paper A on C nor weigh the influence of A in B. In this case, paper A receives a direct citation from paper B, while it receives an indirect citation from paper C. This PhD thesis proposes methods for weighting direct and indirect citations which are based on the semantic citation context similarity. The direct citation weighting is based on the uniqueness of in-text citation contexts, where unique in-text citation contexts attract more weights. The indirect citations are weighted based on the knowledge flow between papers A and C, that is, the semantic similarity between the citation context of paper B in paper A and citation context of paper C in paper B where level of knowledge flow depends on the semantic similarity. Biomedical publications will be used while semantic similarity is calculated based on cosine similarity which is implemented using the Fasttext-based biosentvec word embedding models. The proposed methods have the potential of being useful in determining the research impact of articles, authors and institutions. They can also be useful in sorting of documents retrieved from in-formation retrieval systems.

Development of a taxonomy to detect online deviant behaviors (170)

Zeineb Trabelsi

Over the last decades, the rapid growth of information and communication technology (ICT) has enabled new forms of communication and content sharing, but unfortunately, it has also been associated with a visible increase of deviant behaviors in online communities. Without standardized definitions of the forms of deviant behavior, detecting and classifying these behaviors is a challenging task for moderators and automated systems. Moreover, the lack of consensus makes difficult to identify techniques contributing to effectiveness across moderation systems. Having a consensus on what constitutes each type of deviant behavior can help moderators implement and enforce community rules, help users understand these rules, and create a possibility for collaboration between the moderation teams of different communities. Drawing on an interdisciplinary literature, this study aims to develop and validate a taxonomy of online deviant behaviors in order to help moderators to reinforce their norms. Toward this end, we conduct two studies. First, we classify active behaviors on social networking sites into four categories using the literature. Then, we test the reliability of the developed taxonomy by using inter-rater agreement among three annotators on data collected from public datasets.

Federated Learning for Big Healthcare Data: Motivations and Challenges (181)

Samaneh Miri Rostami

Countless applications of Artificial Intelligence (AI) and Machine Learning (ML) techniques in healthcare data are increasingly revolutionizing in various aspects. Having an accessible data aggregation environment is a successful key point in these data modeling and analytics methods. However, in the healthcare domain datasets are usually isolated, and are too sensitive to be aggregated due to privacy acts and concerns. On the other hand, the relatively new approach of learning methods, known as Federated Learning (FL), gives the opportunity to apply AI techniques in a decentralized fashion on the host data. Therefore, Federated Learning can be used as a new framework in the healthcare sector by allowing collaboration without sharing data. In FL, many constraints often co-occur and make the solving problem a multi-dimensional task that includes machine learning, distributed optimization, security, privacy, information theory, statistics, compressed sensing, and more. Having a practical FL demands the best possible trade-off between efficiency, privacy, and accuracy which is still an ongoing area of research and needs researchers’ considerations.

Estimating Delta-V from Car Crash Images (183)

Hardik Manek and Daniel Silver

Insurtech is a combination of the words “insurance” and “technology,” inspired by the term fintech. It refers to the use of technology innovations designed to squeeze out savings and efficiency from the current insurance industry model. Many investors and financial analysts feel that the insurance industry is ripe for technological and business practice innovation and disruption. The key areas of impact on automobile insurance are felt to be associated with data collection and analysis for better policy pricing, telematics onboard vehicles that can be used to adjust premiums in near-real-time and therefore deal with “moral hazards”, artificial intelligence in the area of claims management for determining things such as payout and fraud. This research aims to streamline the process of claim handling. Delta-V is the change in velocity of the car i.e. velocity of the car before accident subtracted by the velocity of the car after an accident. With the help of Delta-V, the severity of the crash can be predicted and eventually the type of injury suffered by the occupants can also be predicted. Hence the proposed research uses Machine Learning to estimate the Delta-V using accident images. Images are extracted from the CIREN dataset provided by the National Highway Traffic Safety Administration (NHTSA).

Predicting Journey of The Donor Using Deep Learning Models (184)

Ajith Kumar Veera Raghavan

In this paper, I describe how to predict the behavior of donors using machine learning models to help charities receive more gifts. Charities send emails including solicitation and stewardship messages to raise money. We show that what action to take next and what email parameters to use can be learned by Recurrent Neural Networks. After learning, the model can suggest actions that should lead to larger donations. As deep learning models like RNNs are widely used to deal with sequential data, my research makes use of these algorithms to learn which action sequences lead to bigger donations.

Dynamic conversational agent with multiple knowledge bases (185)

Xavier Lindsay and Khadija Essaied

This project aims to create an intelligent conversation-based chat agent to provide text-based customer support in the insurance domain. Given a user with a nebulous idea of what they want, the chat agent must iteratively refine the answers provided to the user in response to a question that is changing in reaction to the previous questions that the chat agent addressed. The target implementation is a dynamic domain question-answering system with multiple knowledge sources. Each knowledge source should pertain to a specific subdomain within the insurance domain namely the general knowledge base, the domain-specific knowledge base and the user-specific knowledge base.

Deep learning-based deconvolution of bulk gene expression profiles to characterize tumor immune landscape of early onset breast cancer (186)

Yong Won Jin

Young age at diagnosis is considered to be an independent factor for higher re-currence and poor clinical outcomes in breast cancer patients. Patterns of tumor infiltrating lymphocytes may provide insight into the underlying biology behind this disparity, which is yet to be discovered. Deconvolution methods can be used to characterize tumor infiltrating lymphocytes given the gene expression profile of the bulk tumor tissue sample. However, existing statistical and traditional ma-chine learning-based methods rely on prior knowledge and a number of assump-tions that limit the interpretability and robustness of their estimates. This is shown by several inconsistencies in estimates between the methods, especially with re-gards to cell types of the innate immune system. The wealth of publicly available gene expression datasets and resources allows for the proposition of a new model of deconvolution using deep neural networks (DNN). This new DNN model will be used to extract patterns of tumor infiltrating lymphocytes distinct in early onset breast cancer that are significantly associated with clinical outcomes and other molecular signatures.

Predicting Pedestrian Traffic Flow and Density (191)

Riddhi Joshi and Daniel Silver

Pedestrian traffic flow rate and density prediction is a useful and challenging area of research in the field of machine learning. Pedestrian traffic information offers useful insights when building or developing a business; and currently, there is no application available that can estimate pedestrian traffic flow into the future. Thus, the problem is to design and implement a system to predict pedestrian traffic flow rate and density for up to five days into the future, based on weather data and calendar events using machine learning.

Sales forecasting using Long Short Term Memory Networks (192)

Yuanyuan Shi

The Annapolis Cider Company of Wolfville, Nova Scotia would like to estimate their future hourly sales data based on the day and time, weather conditions, and knowledge of holidays and special events in Annapolis Valley. We intend to use machine learning methods to predict hourly sales for up to 5 days in advance.

Feature Selection for Charitable Organizations (193)

Syed Safwan

In this project the goal is to develop models to select and create useful features of charitable constituents using machine learning. The domain is the charitable industry and we aim to help them raise more money.

Machine Learning in Terpenes Classification (195)

Deepkumar Shah, Daniel Silver and Andrew McIntyre

Separation and classification of terpenes from food samples is done with the help of two processes, namely: Gas Chromatography (GC) (separate terpenes from each other) and Mass Spectrometry (MS). These method separates and identifies compounds from a given sample. A mass spectrum, generated by mass spectrometry is matched by a chemist from a existing database of mass spectra of various samples. Artificial Neural Networks (ANNs), Support Vector Machines (SVMs) and Deep Learning, in particular, have been widely adopted for many modern supervised learning tasks (e.g., classification and regression problems) and are known for their state-of-the-art generalization performance. The proposed research uses ANNs as part of a tool that can read mass spectra and automatically detect the presence of various terpenes and determine their relative proportions. Early classification results using the SVM algorithm returns an accuracy of 91.43% for four terpenes.