ai for learning
Carnegie Mellon University | 2021
This was the final project for the class "Designing Smart Systems". Each group had to come up with a unique application for AI or Data Science.
Our goal with this project was to explore building a model that augments a student’s ability to process information and prepare for exams by enabling them with distilled, personalized (based on learning/note-taking style), and complete learning notes (at various subordination levels) with relevant examples. Assessments in education broadly fall under two categories; formative and summative (Source: E-Learning Industry). For the purposes of this report, we will be focused on supporting students exclusively with summative assessments.
Team: Yash Banka, Trishala Pillai, Tanvi Bihani

areas
-
Artificial Intelligence
-
Data Science
-
Machine Learning
methods used
-
Ethnographic Research
-
Secondary Research
(articles, reports)
tools
-
Weka
-
Microsoft Excel
-
Figma
context
Traditional pedagogy in education is set up in a way where good grades, a high GPA (among other conventional standards for success) become the key indicators of intelligence and competence. As a result, students worldover review big textbooks, memorize (versus understand or apply) countless concepts, practice hundreds of questions to succeed in exams. They are faced with limited time and heightened pressure to make sense of all the new information. Along the way, their mental well-being can be compromised.
Given the nature of our education system, many students look for an alternative, supplemental learning sources, increasingly powered by technology (including, but not limited to other learning materials, past exam papers, question banks, problem-solution e-learning sites, after school tutoring, coaches, family members, teacher office hours or their peers).
-
Student demand combined with key stakeholders' awareness on “learning to learn” (Source: New York Times) and preparing students for the future of work has given rise to the educational technology industry.
-
As per Edtech’s latest 2021 report, “US EdTech raised $2.2 billion amid a pandemic in 2020, which is a 30% increase from investments in 2019” (Source: EdSurge).
-
A closer look at the investments shows us that the majority of the investments were made in external supplemental or application-based learning experiences (like Course Hero, Skillshare, Masterclass and Udacity)
-
This insight is fairly consistent globally. In India, edtech is “expected to grow to $1.96 billion USD with a 52% compound annual growth rate from $0.25 billion to $1.96 billion between 2016 and 2021” (Source: MoEngage).
feasibility
secondary research
-
A study by ACHA in 2018 found that 57.8% of college students surveyed in the US self-evaluated their stress level as more than average and tremendous (Source: Research.com). In a parallel study from the United Kingdom, “61% of the students cited their source of pressure as getting good grades” (Source: Research.com).

-
Though note-taking is a high-stake determinant of student success in examinations, “students are incomplete note-takers who routinely record just one-third of a lesson’s important concepts in their notes due to technical difficulties, information being presented too quickly (120 to 180 words per minute while they can capture 20 words per minute on average), fatigue and digital distractions” (Source: Open Text BC).



-
Universally, the types of note-taking methods can broadly fit into versions of four categories: lists, outlines, concept maps and the Cornell method (Source: Open Text BC).
Concept Map
Outline Method
List Method
Cornell Method

stakeholders
Students
-
Can positively influence institutions with reporting stronger metrics tied to indicators of student success e.g. higher average scores, lower drop out rates and/or graduation rate, increasing their institution rankings and overall reputation.
-
Happy, healthy and engaged students who are fulfilled with their academic experience and feel supported are more likely to spread the word to other students to drive up student recruitment and retention.
-
Our model helps students reduce/manage the challenges associated with preparing for summative examinations (especially at the last minute).
-
Our application provides a complete, comprehensive yet distilled and organized set of notes in one place, enabling them to connect the dots between concepts in various sources more easily.
-
We aim to support students who are overwhelmed/stressed and reduce their chances of resorting to plagiarism/cheating.
-
Our application intends to provide students with notes that can encourage a healthy, collaborative, peer-to-peer learning environment that is conducive to student growth (with notes that are shareable and can be discussed through prompts).
-
With the use of data visualization, students are able to improve their retention of information, learning efficiency and application of concepts covered.
Parents
-
Parents of students will be able to spend less money, effort and time on tutors, external support and/or supplemental materials to ensure their child is set up for success.
-
They feel assured and comforted about their child’s overall health and wellbeing by seeing that their child feels supported and enabled with a tool powered by our model.
-
They are able to help their child more easily by reviewing the distilled notes themselves and being more involved if they need support.
Educators in Academia
-
With our application, students can stay more engaged in the classroom (knowing that the notes will be taken care, though they can personalize their learning further with additional notes they take in class). As a result, teachers can cater to students engaged in classroom learning versus students multitasking with note taking (among other distractions). Our application can also augment a teacher’s ability to educate students by using their class content in note format to help their students revise and use the notes produced by our model themselves as feedback on the coherence of their lecture.
Educational Institutions
-
Our application can positively influence institutions with reporting stronger metrics tied to indicators of student success e.g. higher average scores, lower drop out rates and/or graduation rate, increasing their institution rankings and overall reputation.
-
Happy, healthy and engaged students who are fulfilled with their academic experience and feel supported are more likely to spread the word to other students to drive up student recruitment and retention.
solution
Currently, students use applications like GoogleDocs, OneNote, Evernote and Notion for note-taking. This requires the student to have quick transcribing skills and be quick typers and can also take one’s attention away from the topics being discussed. AI like Otter helps transcribe recorded lectures but is not always accurate (require significant review and editing time) and does not help with organizing the notes to drive outcomes for students.
To narrow our scope significantly, we built a model that consolidates test papers and analyzes them to identify/organize questioning patterns (format, weightage etc). It then matches them to concepts from relevant learning resources, summarizing and priortizing them to create an organized set of notes.
The computer performs the following tasks:
-
Data input, extraction and scanning
-
Identifying repeating questions and patterns, formats and weightage
-
Matching the questions with the correct answers
-
Identifying underlying concepts behind the answers
-
Finding commonalities between concepts
-
Grouping concepts into chapters
-
Synthesizing it into a set of visualized notes
The user inputs information like grade, educational board, subject, chapter and note-taking style/preference for personalization. If we were building this model for the real world with more resources, we would tap into a range of data sources (e.g. textbooks, lecture notes, videos etc) but for the purposes of this report, we stuck to past examination papers for a specific board and subject in the form of a structured, labelled data that we created ourselves from scratch.

We selected a focus area to experiment with our model:
-
Educational Board - ICSE (India)
-
Subject: Biology
-
Grade: 10
-
Papers: Board Exam past papers and official sample papers
-
Topics: Nervous system, Circulatory system
-
Attributes in our data set:
Educational board, Subject, Paper name, Paper type, Year, Question, Question type, Question no., Sub question no 1, Sub question no 2, Weightage, Chapter, Chapter section, Keyword 1, Keyword 2, Keyword 3, Keyword 4, Keyword 5
As the first step, we identified the questions from the papers and created a draft dataset. This was then used to create a new dataset with the attributes and had 78 rows of data. The keywords were selected from the question statements and in some cases from answers too (e.g. for fill in the blank). We have two versions of this dataset, one of which is encoded.
We used Weka for exploratory data analysis and we carried out logistic classification twice, once for the chapter attribute and then for the chapter section attribute. These preliminary results were promising. While building the dataset we could see patterns in the questions over years. There were patterns in weightages, repetition in questions, question numbers and so on.
dataset

The dataset we built

Encoded Version

Encoding Key

Classification in Weka based on chapter
Classification in Weka based on sub-section

Representation of keyword clustering based on chapter sub-section

limitations
-
Limited availability of data - while past papers are easily available online, other data sources (like lecture recordings/notes for specific classes) that can help us build a smarter system might not be as accessible. There are also legal and privacy restrictions associated with getting hold of this data.
-
Handling inaccurate data - We would need to account for the influence of time, context and relevance of learning on our model.
-
Incomplete & inconsistent data - Two of the challenges we would need to tackle is managing missing fields/records and inconsistencies caused by different data formats used in different sources, attributes/fields meaning different things in different sources (e.g. the use of specific keywords in class versus in a textbook or exam paper might be rooted in different contexts).
-
Algorithmic bias & managing bias in our data set: We would want to ensure that we collect, clean and structure data on more education boards and subjects at more levels of education (among other attributes like country, language, etc) to make the model (and end application) inclusive.
-
Overfitting our model - Building an algorithm that memorizes data instead of learning meaningful dependencies.
-
Ethical implications of the use and adoption of our model - The process of note-taking is instrumental in processing and retaining information (research consistently substantiates this). How do we build our model in a way that ensures students can still reap the benefits of note-taking (perhaps even be encouraged to do it by applying our understanding of behavioral economics and principles of nudge theory). Furthermore, as AI and data science practitioners, it is our moral obligation and responsibility to encourage students to go beyond learning for grades/testing purposes to cultivate lifelong learners.
opportunities
The limitations above are opportunities for us to improve our model. The scope of our model for this report has ethical implications but we do believe that the larger problem of comprehending vast amounts of information from various sources and producing some form of “output” that is digestible, easy to process with levels of abstraction (ranging from what you need to know to what you want to know), adapted to learner needs/styles is an important area to further explore to apply AI and data science. Learning to detect patterns, a reason to answer questions, and draw new conclusions are innately human abilities that can be augmented when humans and computers work together.
For our current model, our vision is to move from a model which is narrow AI using explicit knowledge (programmed with explicit rules) to narrow AI using tacit knowledge, and eventually deep learning (with the ability to generalize concepts from examples given sufficient data and computing power with no human interaction). It uses supervised learning methods and structured data (given the scope of our exploratory research, data set, and preliminary model building).
In the future, we hope to use a combination of unstructured and structured data, using unsupervised learning methods (e.g. identifying interesting patterns/regularities in the data through k-means clustering, association rule discovery, principal components analysis).
Examples of unstructured data could be videos, images, speech and we can utilize AI technologies such as machine vision/image recognition (emotion/movement recognition) and NLP (speech to text, text to speech, information extraction) to fuel our model. For example, it is likely that a professor might point at the board or increase his/her volume while emphasizing an important point, we would want to analyze this unstructured data to increase our model’s accuracy and application’s effectiveness.
Our model’s ability to solve our problem is based on the strength of our data set (size, horizontal attributes, quality, completeness) and modeling technique we use to make sense of it. At present, we were happy with what our model was able to achieve and see promise in the potential of a model like this, in a different context, to help current and future learners (or the individuals who support them).