Home page for the Course Web Page for the 2025-26 Edition
** Read the information at the top of the page, then scroll down to see information about each domain. **
Domain Descriptions
DSC Capstone, 2025-26 @ UC San Diego
Overview
Welcome to the capstone program! The capstone program is a two-quarter sequence (Fall 2025 and Winter 2026) in which you will be mentored by a faculty member or industry expert in their domain of expertise. By the end of Quarter 2, you will design and execute a project from that domain in teams. You can see the projects from last year at dsc-capstone.org/showcase-25.
At a high level, here’s how the capstone program is organized:
- In Quarter 1 (DSC 180A), you gain background information in your mentor’s domain, by means of replicating a known result. By the end of Quarter 1, you will have completed a replication project (known as the “Quarter 1 Project”) and will have a proposal for a more independent project (known as the “Quarter 2 Project”, or the capstone project).
- In Quarter 2 (DSC 180B), you execute the Quarter 2 Project you proposed at the end of Quarter 1.
Enrollment
First pass for enrollment will begin soon. The available domains are not listed on the Schedule of Classes; instead, they are detailed below. Most domains are run by UCSD faculty, but some are run by industry partners (denoted with an Industry Partner badge).
Use the information here to choose the domain you’d like to enroll in. Once you’ve chosen a domain, all you need to do is enroll in the corresponding discussion section for DSC 180A once registration is open, space permitting. Note that you cannot change domains between DSC 180A and DSC 180B.
All of the information here – domain offerings, section times, descriptions, summer tasks, etc. – is subject to change as mentors provide us with more information.
How should I choose a domain?
You should aim to choose a domain that suits your interests and preparation. By clicking the Read more button underneath a domain, you’ll get to learn more about the mentor, their mentoring style, the prerequisites that they’d like their students to have, tasks that they’d like their students to work on over the summer, and their students’ capstone projects in previous years (if any).
✅ Good reasons to choose a domain:
- You are sufficiently prepared.
- You find the topic interesting and would be motivated to study it.
- You are likely to work well with the mentor given their mentoring style and research background.
❌ Bad reasons to choose a domain:
- The mentor has good teaching evaluations.
- The industry partner is a big company.
- The section has space for you and your friends at a time that you like.
Everything you produce for the capstone will have to be public on the internet for the rest of eternity with you and your mentor’s names attached to it – you want your capstone work to be something that you’re proud of and can talk about on job and graduate school applications. Who do you want writing you a recommendation letter?
What happens in DSC 180A?
In addition to meeting with your mentor each week, there will also be methodology instruction delivered by the capstone coordinator and the methodology course staff. However, the majority of this instruction will occur asynchronously, in the form of readings (like this one). This means that you can mostly ignore the lecture and lab times that appear for DSC 180A on the Schedule of Classes. A few of the lecture slots may be used for the capstone coordinator’s office hours or for one-off guest lectures, but we don’t plan to use the majority of the times.
All prerequisites for DSC 180A will be strictly enforced. The prerequisites for DSC 180A can be found here. If you took either DSC 140A, DSC 140B, or DSC 148 to satisfy the machine learning prerequisite, you may need to submit an Enrollment Authorization System request in order to enroll in DSC 180A in fall quarter.
Note that since DSC 180A and DSC 180B are both 4-unit courses, you should expect to spend 12 hours a week on capstone-related work each quarter. Plan your class schedule accordingly – try not to take several time-consuming classes alongside the capstone.
Who is overseeing the capstone?
With any questions about the capstone sequence itself, feel free to email Umesh Bellur (ubellur@ucsd.edu) for now.
With any questions about the content of a particular domain, contact the mentor. With any questions about enrollment, please contact Student Affairs in the VAC.
NLP Credit Score Development
Brian Duke, Kyle Nero • brian.duke@prismdata.com, kyle.nero@prismdata.com
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A01 12 seats Thursday 1-2p
One of the most widely used and little understood parts of the Financial Services industry is the credit score. In this course, students will work with transactional bank data to build statistical models for the purpose of assessing creditworthiness in the financial services industry. The course will take students through the life of a model development project, from data exploration, through model training and evaluation. Students will have the opportunity to work with both structured and unstructured data as they learn about the process and attributes that go into credit scores. Additionally, students will learn about the importance of model explainability and fairness.
Read more
- About: Brian Duke has been a data scientist for 23 years in the Financial Services industry. He has worked at Capital One, FICO, SAS Institute, Accenture, Experian, Petal Card and currently is the Head of Data Science at Prism Data. A common theme in his work has been translating transactional data into useful scores and analytical insights for use in risk decisioning. Brian received his BA and MS from the University of California, San Diego and continues to reside in the San Diego area today. He holds 4 patents and has 12 pending in the United States. Kyle Nero graduated from UCSD HDSI in 2023, majoring in data science and minoring in business. During his senior year, he engaged with industry partner Prism Data through the HDSI Senior Capstone Project. He went on to intern with Prism Data following his graduation and joined the team full time as a Data Scientist in September 2023.
- Mentoring Style: Our group will consist of group projects completed in groups of 3-4. The goal of the course is to eventually build a credit score but we will start by building a transaction categorization model using NLP techniques. Each week we will talk about techniques that can be applied to the next step in the project. We will begin by reviewing homework from the previous week and discussing ideas. Then introduce the next step and talk about what can be done to solve the next step in the problem. The goal is to introduce students to the model development process in most financial services companies.
- Suggested Prerequisites:
- Summer Tasks: https://www.capitalone.com/learn-grow/money-management/when-did-credit-scores-start/, https://www.capitalone.com/learn-grow/money-management/fair-credit-reporting-act/, https://www.capitalone.com/learn-grow/money-management/equal-credit-opportunity-act/, https://www.nerdwallet.com/article/finance/credit-score-ranges-and-how-to-improve and https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=An%20ROC%20curve%20(receiver%20operating,False%20Positive%20Rate
- Previous Project
Interplay between Machine Unlearning and Optimization
Jun-Kun Wang • jkw005@ucsd.edu
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A02 4 seats Fridays at 1PM.
“Machine Unlearning” concerns the scenario in which a model trained by an algorithm on a dataset is updated to respond to a request to "delete" certain sample points used in its training. The motivation arises due to the recent success of Large Language Models (LLMs) and other foundation models that are potentially trained on large corpora of data, some of which might contain copyrighted data points. Developing methods to tackle this problem has become an urgent need, partially due to recent regulations such as the EU's *Right to be Forgotten*. In this project, we will first review related works, e.g., [1]-[4]. We will then consider developing our own methods. Students working on this project will be required to implement and conduct comprehensive experiments to test the proposed methods using Python/PyTorch. References: [1] TOFU: A Task of Fictitious Unlearning for LLMs Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, J. Zico Kolter. Conference on Language Modeling (COLM) 2024 [2] Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition. Triantafillou et al. arXiv:2406.09073 [3] Exact Unlearning of Finetuning Data via Model Merging at Scale Kevin Kuo, Amrith Setlur, Kartik Srinivas, Aditi Raghunathan, Virginia Smith. MCDC @ ICLR 2025 [4] Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning. Ruiqi Zhang. Licong Lin, Yu Bai, Song Mei. Conference on Language Modeling (COLM) 2024
Read more
- About: I am an assistant professor at HDSI and ECE. My research is centered around optimization and its connections with statistics and machine learning.
- Mentoring Style: Students will be expected to use Python/PyTorch to implement their algorithms (and should be able to code).
- Suggested Prerequisites:
- Summer Tasks: Try to understand the materials on this website https://unlearning-challenge.github.io/. We will test our algorithms by following the experimental setup and the datasets detailed on this site.
- Previous Project
open LLM training, inference, and infrastructures
Hao Zhang • haz094@ucsd.edu
TA: Zoom
A03 8 seats TBD
The rapid advancement of large multimodal models has revolutionized AI systems, resulting in unprecedented levels of intelligence as seen in OpenAI’s GPT-4. However, despite its performance, the training and architecture details of GPT-4 remain unclear, hindering research and open-source innovation in this field. In this project, we'll explore three relevant areas to LLMs: - On the system side: infrastructure for scalable training and high-throughput serving with advanced memory management and parallelization techniques. - On the model side: build multimodal model close to ChatGPT quality, which can also interact with the real world by taking actions and using tools. - On the data and benchmark side: develop a highly curated data sets and benchmark platform with novel data augmentation, data filtering, and ranking methods.
Read more
- About: Hao Zhang is an Assistant Professor in Halıcıoğlu Data Science Institute and the Department of Computer Science and Engineering at UC San Diego. Before joining UCSD, Hao was a postdoctoral researcher at UC Berkeley working with Ion Stoica (2021 - 2023). Hao completed his Ph.D. in Computer Science at Carnegie Mellon University with Eric Xing (2014 - 2020). During PhD, Hao took on leave and worked for the ML platform startup Petuum Inc (2016 - 2021). Hao's research interest is in the intersection area of machine learning and systems. Hao's past work includes vLLM, Chatbot Arena, Vicuna, Alpa, Poseidon, Petuum. Hao’s research has been recognized with the Jay Lepreau best paper award at OSDI’21 and an NVIDIA pioneer research award at NeurIPS’17. Hao also cofounded the company LMNet.ai (2023) which has joined Snowflake since November 2023, and the nonprofit LMSYS Org (2023) which maintains many popular open models, evaluation, and systems.
- Mentoring Style: hands off
- Suggested Prerequisites:
- Summer Tasks: Read papers in https://cseweb.ucsd.edu/~haozhang/publication
- Previous Project
Evaluation Strategies for Next-Generation AI Systems
Rajeev Chhajer, Ryan Lingo • rajeev_chhajer@honda-ri.com, ryan_lingo@honda-ri.com
TA: Zoom
A04 12 seats Mondays at 12pm-1pm PST
As AI systems powered by Large Language Models (LLMs) become increasingly pivotal in high-impact scenarios, conventional metrics such as accuracy on public datasets are no longer sufficient. This domain revolves around exploring more flexible, context-aware methodologies for evaluating system performance across diverse dimensions, including depth of reasoning, ethical alignment, user experience, and practical effectiveness. Students will investigate and prototype varied assessment methods in multiple use cases and data environments, potentially incorporating automated audits to ensure continuous model alignment with organizational norms. By designing scalable evaluation pipelines and adapting them to specific application areas, participants will work toward developing AI systems that are more transparent, responsive, and dependable in evolving real-world contexts.
Read more
- About: Rajeev Chhajer is the Chief Engineer at Honda Research Institute USA and leads the Software-defined Intelligence team at 99P Labs. He is a founding member of 99P Labs, a research initiative dedicated to developing sustainable technologies and innovative approaches to global challenges. His research focuses on smart city ecosystems, embedded systems, and connectivity to support sustainable and efficient mobility. Ryan Lingo is an Applied AI Research Engineer and Developer Advocate at 99P Labs. His work focuses on intelligent systems, with research interests in large language models, synthetic data, and applied machine learning. He has an academic background in philosophy and has held roles in data science and software engineering, with experience spanning academic research, industry, and early-stage startups.
- Mentoring Style: We plan to take an engaged but student-led approach to mentoring. We'll work closely with the students throughout the project—meeting regularly, providing guidance, and being available for feedback and support. That said, we're looking for high-agency students who are excited to take ownership of their learning and direction. The best mental model for this capstone is "learning in public." Students will play an active role in shaping the plan and setting objectives. Rather than being given step-by-step instructions, they'll be encouraged to explore, make decisions, and figure out how to execute their ideas, with our mentorship to guide the way. We'll help them think critically, problem-solve, and communicate their process and outcomes clearly. While we won’t dictate tasks at a granular level, we’ll be present every week and ensure they have the support and structure they need to succeed.
- Suggested Prerequisites:
- Summer Tasks: Before the quarter starts, if you are interested in getting a head start, these three works offer advanced perspectives on evaluating large language models: • [Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/) by Hamel Husain This work dives into methods for assessing LLMs with a more holistic lens—addressing performance from qualitative angles that go beyond standard metrics. It challenges us to think about aspects such as contextual coherence and adaptability. • [A Field Guide to Rapidly Improving AI Products](https://hamel.dev/blog/posts/field-guide/) by Hamel Husain This guide provides practical strategies and frameworks for developing tailored evaluation pipelines. It emphasizes systematic, reproducible methods that capture nuanced performance characteristics and are well-suited for real-world scenarios. • [Task-Specific LLM Evals that Do & Don't Work](https://eugeneyan.com/writing/evals/) by Eugene Yan This piece critically examines traditional evaluation metrics and offers alternative, context-sensitive approaches to measuring LLM performance. It highlights the importance of incorporating reasoning depth, fairness, and other ethical considerations into our testing frameworks. These resources together reinforce a shift toward evaluation methods that are context-aware, scalable, and aligned with the intricacies of modern language models. Focusing on these ideas will help you design and prototype robust evaluation pipelines that can address the evolving challenges in AI. If you'd like to discuss the project further, please connect on LinkedIn with Ryan Lingo; he would be happy to discuss further.
- Previous Project
Mining Privacy Designs in the News
Haojian Jin • h7jin@ucsd.edu
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A05 10 seats Tuesday afternoon
Mining Privacy Designs in the News aims to uncover the patterns, strategies, and tensions in how digital privacy is represented across news media. By analyzing a large corpus of news articles, this work surfaces recurring narratives, metaphors, and framings that shape public understanding of privacy. Through this lens, we can trace how societal norms, stakeholder agendas, and cultural anxieties are reflected in—and influenced by—media discourse, offering a foundation for more responsive design, policy, and public engagement.
Read more
- About: Human developers create risky computer systems that eventually affect human users. Our lab, Data Smith Lab, takes a human-centered approach to (1) help developers create systems with enhanced privacy and security features and (2) help users safeguard their privacy and security.
- Mentoring Style: We will have weekly meetings. My phd students will also help the team. The ultimate goal of the research project is to produce a research paper.
- Suggested Prerequisites:
- Summer Tasks: https://www.haojianj.in/resource/pdf/sketchprivacy.pdf
- Previous Project
Digital twin model for health with wearable data
Tauhidur Rahman • trahman@ucsd.edu
TA: Hybrid (Zoom and in-person)
A06 8 seats Monday AM. We will figure a specific timeslot out when the quarter gets near and we know all of our schedules better.
The biotech and biomedical industries rely heavily on randomized controlled trials (RCTs) to evaluate the efficacy and safety of medical interventions, yet these trials are prohibitively expensive and time-consuming—costing an average of $1.3 billion per drug and spanning 10–15 years for market approval in the pharmaceutical sector. Similarly, preclinical animal testing adds significant financial and ethical burdens, with failure rates exceeding 85% when translating results to humans. In this proposal, we seek to address these inefficiencies by advancing digital twin technologies as a transformative paradigm in biomedical research and development. Digital twins—AI-driven computational replicas of physiological systems—have the potential to accelerate innovation by providing scalable, cost-effective, and ethically viable alternatives to conventional RCTs. They can simulate disease progression, predict patient responses to interventions, and enable in silico testing of therapeutics, reducing reliance on human and animal trials. However, major gaps hinder their widespread adoption, including (1) lack of standardized, generalizable representations of physiological systems across individuals and timescales, (2) computational inefficiencies in personalizing models for diverse populations, and (3) limited capacity for knowledge transfer between different interventions and treatment domains. This project aims to bridge these gaps by establishing foundational methods for scalable, personalized, and computationally efficient digital twins. Through advanced physiological modeling, knowledge graph integration, and large language model (LLM)-augmented frameworks, we will develop novel approaches to enhance prediction, personalization, and decision-support in biomedical innovation. Our proposed methods will not only improve the efficiency and reliability of digital twins but also provide a generalizable infrastructure applicable across diverse medical domains, including diabetes management, kidney dialysis, and substance use disorder interventions. To address these challenges, we propose three core research thrusts: (i) Representation of Physiological Systems for Closed-Loop Modeling: We will develop featurization and representation learning techniques that generalize across diverse biomedical data types (from physiological signals to symptoms and comorbidities), ensuring adaptability across time scales (minutes to years) and population levels (individuals to cohorts). (ii) Computationally Efficient Digital Twins with Personalization Mechanisms and Knowledge Graphs: We will design scalable, personalized digital twin models that dynamically identify similar individuals within a population to enhance forecasting accuracy for physiological states based on specific interventions. Additionally, we will integrate knowledge graphs to learn relationships between different interventions, allowing the model to extrapolate to novel treatment strategies. (iii) Demonstration of a Generalizable Large Language Model (LLM)-Augmented Digital Twin Framework and Dashboard: We will implement and validate a scalable, interactive framework to support digital twin experimentation in multiple biomedical domains, including diabetes management, kidney dialysis, and substance use disorder interventions.
Read more
- About: I direct the Mosaic lab.
- Mentoring Style: I along with my PhD student will attend the meeting every week. My mentoring style is really to help you to know the right tools, help you to identify the intermediate goals and eventually ask challenging questions. Eventually the success will largely depend on you and your approach to tackle complex questions with lots of uncertainty. I want you to take leadership (almost consider this as an opportunity to write your paper or build your startup, just as an example) and I will be there to help you to navigate the complex landscape of uncertainty in research.
- Suggested Prerequisites:
- Summer Tasks: Enhance your familiarity with LLM, Deep learning, Generative AI tools and techniques.
- Previous Project
Developing lessons to ease teachers into time series data science
Benjamin Smarr • bsmarr@ucsd.edu
TA: In-person (will book own room)
A07 8 seats After 1, before 5
Data science skills can be transformative, especially when deployed in communities without historical access to digital resources. UCSD supports tools that make access to data science free, but without lessons aimed at helping teachers make confident use of these tools, most teachers won't get the potential of data science training, and won't want to offer curricula to their students. Our goal will be to develop light weight lessons that take someone from zero knowledge to the ability to load small data sets into JupyterLite webpages and carry out basic analyses and visualization, so that they can feel good showing others in their community how these skills could help their students.
Read more
- About: Prof. Smarr comes from a neuroscience and biological rhythms background. His lab focuses on using longitudinal data sources to develop novel analytics that reveal biologically relevant information from these data, framed by an understanding of the way biological data tend to change at different timescales. This is sometimes naturalistic, but more often related to biomedical algorithm development.
- Mentoring Style: You will work mostly within your group, with some guidance from graduate students who have done related work. You will also meet weekly with Prof. Smarr to assess progress and shape next steps.
- Suggested Prerequisites:
- Summer Tasks: Read https://pmc.ncbi.nlm.nih.gov/articles/PMC5301430/ and https://pubmed.ncbi.nlm.nih.gov/35870975/, and develop some familiarity recreating the figures and analyses.
- Previous Project
Communication Complexity
Shachar Lovett • slovett@ucsd.edu
TA: In-person (will book own room)
A08 4 seats Mondays 1-2pm
Communication complexity is a field of theoretical computer science that studies the amount of communication required between two or more parties to jointly compute a function whose input is distributed among them. In its most basic model, two players—Alice and Bob—receive inputs x and y respectively, and want to jointly compute some function f(x,y) using as little communication as possible. The focus is on minimizing the number of bits exchanged, under various models (deterministic, randomized, or quantum), while still correctly computing the function. This model has been studied extensively since its introduction in the 80s, with many notable applications, but still there are many fundamental problems that are wide open. These include the log-rank conjecture, which attempts to classify the structure of efficient deterministic protocols using linear algebra; understanding the combinatorial structure of functions with efficient randomized protocols; analogs of these to multi-party settings; and many others. Moreover, problems in communication complexity tend to have strong connections to fundamental problems in combinatorics and other areas of math, which sometimes gives a unique perspective to help understand these problems better.
Read more
- About: I work broadly in theoretical computer science, with a specific interest in computational complexity, randomness and pseudo-randomness, algebraic constructions, optimization, and combinatorics.
- Mentoring Style: Initially the capstone project will run standalone, as students learn the area. Then, depending on their interest, we can potentially join an existing research project or they work on their own research project.
- Suggested Prerequisites:
- Summer Tasks: The book "Communication Complexity and Applications" by Rao and Yehudayoff is a good starting point. See also the following surveys for more recent progress related to my work: Recent advances on the log rank conjecture: http://bulletin.eatcs.org/index.php/beatcs/article/download/260/245 Models of computation between decision trees and communication: https://escholarship.org/content/qt80d93481/qt80d93481.pdf
- Previous Project
Understanding deep learning through feature learning
Tianhao Wang • tianhaowang@ucsd.edu
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A09 8 seats Friday
The success of deep learning is often attributed to the ability of neural networks to learn useful features from data. Yet, the process of feature learning remains mysterious. In this project, we aim to develop fundamental understanding of how neural networks learn features, with an emphasis on the dynamical perspective of the training process. We will explore feature learning for various neural network architectures, and investigate surprising phenomena in deep learning such as implicit regularization and grokking, etc. Participants will gain hands-on experience in training neural networks and analyzing the training process.
Read more
- About: Tianhao Wang will be joining the Halıcıoğlu Data Science Institute at UC San Diego as a tenure-track assistant professor in July 2025. He received his PhD from the Department of Statistics and Data Science at Yale University in 2024. Prior to Yale, he received his BS from University of Science and Technology of China in 2018. His research interests are mainly in theoretical foundations of statistical learning, especially high-dimensional statistics and deep learning.
- Mentoring Style: Participants should feel free to explore on their own if they find a specific topic interesting. Meanwhile, participants will have opportunities to work with my PhD students. I will provide hands-on guidance.
- Suggested Prerequisites:
- Summer Tasks: It would be helpful to get familiar with deep learning frameworks such as PyTorch or Jax.
- Previous Project
Deep Learning for Climate Model Emulation
Duncan Watson-Parris • dwatsonparris@ucsd.edu
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A10 6 seats Wednesday 2-3pm
The choices humanity makes in the next few decades will determine how much warmer the Earth will be by the end of the century, with implications for billions of lives and trillions of dollars in GDP. Many different emission pathways exist that are compatible with the Paris climate agreement, and many more are possible that miss that target. While some of the most complex climate models have simulated a small selection of these, it is impractical to use these computationally expensive models to fully explore the space of possibilities or assess all the associated risks. Our lab has recently developed state-of-the-art climate model emulators to enable fast, accurate and reliable predictions for any given scenario (https://github.com/duncanwp/ClimateBench). This project will extend this work by incorporating multiple climate models at different levels of fidelity to provide high-resolution predictions with robust uncertainties for improved decision making.
Read more
- About: Duncan Watson-Parris is an atmospheric physicist working at the interface of climate research and machine learning. The Climate Analytics Lab (CAL) he leads focuses on understanding the interactions between aerosols and clouds, and their representation within global climate models. CAL is leading the development of a variety of machine learning tools and techniques to optimally combine a variety of observational datasets, including global satellite and aircraft measurements, to constrain and improve these models. Duncan is also keen to foster the application of machine learning to climate science questions more broadly and convenes the Machine Learning for Climate Science EGU session and co-convenes the “AI and Climate Science” discovery series that is part of the United Nations’ AI for Good program.
- Mentoring Style: This work is central to CAL so I will meet personally with the students, and it can be integrated into the broader research program to the extent the students want to engage with it. The students will be welcome to join our Lab meetings (typically held at Scripps Institution of Oceanography).
- Suggested Prerequisites:
- Summer Tasks: - Skim the latest UN Intergovernmental Panel on Climate Change Synthesis Report to get a summary of the latest climate change science, especially the figures: https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_SPM.pdf - Read the ClimateBench paper: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2021MS002954 - Try out the xarray Python library for working with climate data: https://docs.xarray.dev/en/stable/
- Previous Project
Data Valuation & Curation for Trustworthy AI
Babak Salimi •
TA: Zoom
A11 8 seats Friday 3:00 – 4:00 PM PT
Modern machine-learning pipelines increasingly live or die on data decisions—what to keep, what to toss, what to label next, and how much each record is “worth.” This domain surveys the growing toolkit of data-valuation and data-centric AI methods—Data Shapley (ICML 2019), influence functions, MMD/Wasserstein-based coresets, active-learning and subset-selection heuristics—that assign a quantitative score to every example’s marginal impact on model accuracy, robustness, and privacy risk. We’ll examine how those scores power practical tasks such as pruning noisy or duplicated records, budgeting scarce labeling effort, spotting harmful outliers, and auditing publicly released datasets. Students will dive into open benchmarks like WILDS (https://wilds.stanford.edu/) for distribution shifts and the DataPerf challenges (https://dataperf.org/) for curation leaderboards, replicate baseline valuation techniques in Quarter 1, and use the insights—and gaps—they uncover to propose original data-selection or curation projects for Quarter 2.
Read more
- About: Babak Salimi is an Assistant Professor in the Halıcıoğlu Data Science Institute at UC San Diego. His research lies at the intersection of data management and machine learning, with a focus on building reliable, safe, and robust data-driven systems. He develops methods and tools that enhance the transparency and dependability of algorithmic decision-making, empowering practitioners to make informed and confident choices.
- Mentoring Style: Capstone students will operate as an independent cohort: we will meet for a dedicated one-hour session each week where I lead mini-lectures, paper discussions, and milestone check-ins. While my PhD students are not formally part of the section, I may occasionally invite them to hold optional office-hour–style drop-ins for coding or tooling questions. I’ll be hands-on during project framing, data wrangling, and experimental design, then step back so teams can drive their own experiments and insights, checking progress and providing strategic guidance every week.
- Suggested Prerequisites:
- Summer Tasks: Readings — to ground everyone in core data-valuation methods: • “Data Shapley: Equitable Valuation of Data for Machine Learning” (ICML 2019) • “LAVA: Data Valuation without Pre-Specified Learning Algorithms” (ICLR 2023) • “Adversarial Attacks on Data Valuation” (NeurIPS 2024) Hands-on — pick one: Clone the LAVA GitHub repo and run the quick-start script on a small UCI dataset, producing a ranked list of data values.
- Previous Project
Large Language (Multi-Modal) Model Reasoners and Agents
Zhiting Hu • zhh019@ucsd.edu
TA: Zoom / In-person (need room booked for me) hybrid
A12 10 seats Tuesday 3-4PM
A central topic in Large Language or Multi-Modal Model research is to enhance their ability of complex reasoning on diverse problems. Rich research has been done to generate multi-step reasoning chains with LLMs, such as Chain-of-Thoughts (CoT), Reasoning-via-Planning (RAP), OpenAI o-series, etc. This capstone aims to explore the diverse reasoning approaches of LLMs (and/or large multi-modal models) and investigate improvement, applications, and scalable implementations of these approaches. For example: (1) Proposing new reasoning algorithms or improvement over existing reasoning algorithms in terms of performance; (2) Developing algorithmic and/or system innovations to scale up existing advanced reasoning algorithms; (3) Developing new agent frameworks for various applications like deep research, AI scientists, real-world embodied and social agents, etc.
Read more
- About: Zhiting Hu is an Assistant Professor in Halicioglu Data Science Institute at UC San Diego. He received his Bachelor's degree in Computer Science from Peking University in 2014, and his Ph.D. in Machine Learning from Carnegie Mellon University in 2020. His research interests lie in the broad area of machine learning and artificial intelligence, with a focus on principles, methodologies, and systems of building AI agents that learn and reason with efficiency and generality. His current work centers on building general world models for next-generation machine reasoning and unified learning mechanisms for training machines with all types of experience. His research was recognized with outstanding paper awards at ACL 2016 and ACL 2024, and best demo nominations at ACL 2019 and NAACL 2024.
- Mentoring Style: Students can either join the mentor's research group to work closely with PhD students/postdocs on relevant projects, or propose their own ideas and lead the projects. Students are expected to be independent, and mentor will provide necessary advices if needed (PhD students/postdocs can also provide more hands-on guidances).
- Suggested Prerequisites:
- Summer Tasks: [1] https://arxiv.org/abs/2404.05221 [2] https://arxiv.org/abs/2312.05230 [3] https://arxiv.org/abs/2305.14992
- Previous Project
Analysis of Temporally Varying Point Cloud using Optimal Transport
Alex Cloninger, Rayan Saab • acloninger@ucsd.edu, rsaab@ucsd.edu
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A14 8 seats Early afternoons Mondays is ideal, early afternoons TuTh may work too depending on teaching.
Time varying point clouds appear in a number of important applications. These range from Motion Capture (MOCAP) data, to molecular and particle dynamics, to crowd and swarm dynamics. In these applications, each "datum" of interest is a multi-dimensional time series of a large number of points over a long number of time steps, and the questions associated are how to cluster and classify these data, or how to generate new examples. Unfortunately, analysis of these problems can be quite complex. Fundamentally, this boils down to three issues: 1) lack of point-to-point registration from one time scale to a later time, 2) the lack of time series tools for dealing with high dimensional time series, especially when the data is not in a simple Euclidean vector space, and 3) the sheer size of the data storage / computation for just one time series example. The domain of this project will cover methods in comparing these point clouds as coming from distributions that are time varying, and thinking about analysis of these distributions. One tool we will use for these analyses is optimal transport, which can benefit the problem both theoretically and computationally. We will also consider deep learning and signal processing approaches to these types of data. Students who choose this project will delve into the mathematical and computational problems of these data types, utilizing tools from probability and statistics, signal processing, and linear algebra. They will also engage in hands-on coding and experimentation on algorithms for optimal transport and time series models, testing them on various data sets.
Read more
- About: Alex Cloninger is an Associate Professor in Mathematics and the Halicioglu Data Science Institute. He works on computational models for learning similarities between data, and using these similarity measures to solve various scientific problems. Rayan Saab is a Professor in the Mathematics Department and at the Halicioglu Data Science Institute. He works on developing computational methods and theory for solving problems related to collecting, processing, and analyzing data. He came to this work first through an undergrad degree in electrical engineering and finding himself always interested in both making things work and understanding why they do.
- Mentoring Style: We both are relatively hands-on in the sense that we make ourselves available for problem-solving and discussions. That said, students have to be self-motivated, and motivated to do the readings and the work.
- Suggested Prerequisites:
- Summer Tasks: Here are some relevant readings / videos. Students need not go into the mathematical details as we can go through them together, but these papers give an idea of the different approaches and applications. The more familiar you are with the topic, the more we can do! **Optimal Transport** - https://www.youtube.com/watch?v=EauDdCzxphE - https://www.youtube.com/watch?v=mITml5ZpqM8 **Example Applications with Temporal Point Clouds** - https://mocap.cs.sfu.ca/ - https://www.ipb.uni-bonn.de/data/4d-plant-registration/ To be able to obtain really nice experimental results, you'll need to pick up PyTorch and also the POT: Python Optimal Transport toolbox.
- Previous Project
ALERTCalifornia - Extreme Events Detection
Nathan Hui (Qualcomm Institute/JSOE <nthui@ucsd.edu>), Falko Kuester (Qualcomm Institute/JSOE <fkuester@ucsd.edu>), Neal Driscoll (SIO <ndriscoll@ucsd.edu>) • nthui@ucsd.edu, fkuester@ucsd.edu, ndriscoll@ucsd.edu
TA: TBD
A15 4 seats Mo/Tu/We/Th 10am-3pm
The ALERTCalifornia research program continues UC San Diego’s more than 20-year legacy of collecting high-quality data through a network of natural hazard monitoring and detection cameras across the state. This growing network includes over 1,150 camera sensors that provide real-time imagery. They are located in wild spaces, on towers, and other high points across the entire state of California, and are used to watch for and monitor extreme events including wildfire and weather. The program’s historical archive of camera data contains over 38 billion timestamped and localized frames. These camera data have facilitated CALFIRE’s ability to rapidly respond to emerging wildfires as well as maintain situational awareness during ongoing wildfires and other natural disasters. We would like to investigate where machine learning techniques can assist with assessing camera network health, data integrity, and environmental signals. Potential projects include camera site uptime detection, cloud detection, marine layer height detection, Visual Flight Rules altitude estimation, horizon detection, or camera positioning calibration.
Read more
- About: Nathan Hui is currently a research engineer at UC San Diego at the Qualcomm Institute. His area of focus is multi-domain robotics, 3D imaging, and distributed sensor networks. Previous projects include tracking transmittered wildlife using drones, measuring physical oceanographic data using intelligent surfboard fins, and measuring fish length using low-cost lasers, dive cameras, and machine learning. Prof. Kuester received an MS degree in Mechanical Engineering in 1994 and MS degree in Computer Science and Engineering in 1995 from the University of Michigan, Ann Arbor. In 2001 he received a Ph.D. from the University of California, Davis and currently is the Calit2 Professor for Visualization and Virtual Reality at the University of California, San Diego. Professor Kuester holds appointments as Professor in the Departments of Structural Engineering and Computer Science and Engineering at the Jacobs School of Engineering (JSoE) and serves as the director of the Cultural Heritage Engineering Initiative (CHEI), the Center of Interdisciplinary Science for Art, Architecture and Archaeology (CISA3), the Calit2 Center of Graphics, Visualization and Virtual Reality (GRAVITY) and the DroneLab. Neal Driscoll is a professor of geology and geophysics in the Geosciences Research Division at Scripps Institution of Oceanography at UC San Diego. Driscoll researches tectonic deformation and the evolution of landscapes and seascapes. His work primarily focuses on the sediment record to understand the processes that shaped the earth. As part of this research, Driscoll spends time at sea acquiring images of the seafloor and subsurface layers to understand the processes that shape Earth. Driscoll is also co-director of UC San Diego’s Center for Public Preparedness (CP2) and the ALERTCalifornia public safety program. ALERTCalifornia provides critical infrastructure for mitigating wildfire and natural disaster risk to life, property and ecosystems. The advanced network of more than 1000 cameras across California helps first responders monitor natural disasters such as wildfires, floods, and landslides. ALERTCalifornia is a vital resource that provides an array of technological tools, infrastructure and research that supports government agencies, utilities and the public in their response to ever-increasing natural disaster risk. ALERTCalifornia also gathers vital data to inform the greater understanding of natural disaster causes, active event behavior and post-event impacts to air quality, water quality, ecosystems, and human health.
- Mentoring Style: We can facilitate mentorship in our facilities (Atkinson Hall). This will occur as part of our research group (regular meetings), with additional oversight under ALERTCalifornia (milestone updates).
- Suggested Prerequisites:
- Summer Tasks: Students should be able to utilize Nautilus NRP, be familiar with active learning techniques, semi-supervised or unsupervised learning, and utilizing web APIs. We expect most work to be done in Python. Please also be familiar with Docker, Poetry, Kubernetes, and Tornado.
- Previous Project
Large Language Models in Healthcare
Aaron Boussina; Karandeep Singh • aboussina@health.ucsd.edu; karandeep@health.ucsd.edu
TA: Zoom
A16 6 seats Monday, 10am
The release of GPT-4 in 2023 captured global attention when it demonstrated the ability to pass the United States Medical Licensing Examination (USMLE). Since then, Large Language Models (LLMs) have started to transform many aspects of healthcare, including how patients access medical information, how clinicians document care, and how payers and regulators manage and review clinical workflows. This surge in capability has fueled a wave of startups and initiatives—such as Hippocratic AI—that aim to deploy LLMs for high-stakes, patient-facing applications. However, the complexity, heterogeneity, and high-risk nature of the medical domain present unique challenges that general-purpose AI systems are not inherently equipped to handle. Critical questions remain about when LLMs can be considered safe for clinical use, how to systematically evaluate their performance on bespoke medical tasks, and how to integrate them into healthcare operations in a way that enhances rather than undermines clinical quality and patient safety. In this domain, students will build, adapt, and rigorously evaluate LLM-based systems for healthcare applications. Projects will focus on operationalization challenges such as ensuring factual accuracy, mitigating hallucinations, designing appropriate evaluation frameworks, and aligning model outputs with clinical standards and ethical considerations. Students will gain hands-on experience with both the technical aspects of model development and the practical challenges of deploying AI responsibly in healthcare settings.
Read more
- About: Karandeep is a physician-scientist with expertise in the evaluation and implementation of statistical and machine learning models into the clinical and operational context. His research lab’s focus is on understanding translational issues of bringing AI into clinical practice, including transportability and generalizability issues, dataset shift, and clinical and operational outcomes. He serves as Chief Health AI Officer for the UC San Diego Health System and has a leadership role in the Jacobs Center for Health Innovation. He has >90 peer-reviewed publications focused primarily on machine learning, digital health, and natural language processing. The core focus of Aaron’s research is the development and implementation of predictive and generative systems in healthcare settings. His recent work on reducing sepsis-related mortality using deep learning was featured in Nature Digital Medicine, Fortune, KPBS, and referenced in the Bipartisan House Task Force Report on AI. His recent work in the New England Journal of Medicine AI was the first publication to explore the use of generative AI to automate and scale costly documentation for hospital quality measurement. To enable code-to-clinic contributions, his research combines multiple disciplines including software engineering, deep learning, healthcare informatics, and implementation science.
- Mentoring Style: Capstone students will be integrated into the Jacobs Center for Health Innovation (JCHI) research group and will lead an independent project. Students will be expected to manage their projects, develop their software, test their hypotheses, and submit a peer-reviewed paper by the end of the course (with mentorship).
- Suggested Prerequisites:
- Summer Tasks: Papers: - https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2781307 - https://ai.nejm.org/stoken/default+domain/VIG3W4P4GC3SIQASJF4D/full?redirectUri=doi/full/10.1056/AIcs2400420 Packages: - Transformers - vLLM
- Previous Project
Transformers for graph learning
Yusu Wang and Gal Mishne • yusuwang@ucsd.edu; gmishne@ucsd.edu
TA: TBD
A17 16 seats Wed morning 9am preferred
Graph data is ubiquitous in a broad range of applications. There are two major families of graph learning models: graph neural networks, and transformer-based graph learning models. However, while graph neural networks, such as message-passing graph neural networks, can naturally handle graph-type data (both nodes, edges, and features along nodes and edges), standard transformer architectures are designed for sequences or sets, not for the relational type of data like graphs. There thus has to be a way to encode the graph structures into transformers. There are various design choices. In this capstone project, we would like to explore the pros and cons of different design choices of encoding graph information over a collection of graph tasks.
Read more
- About: Yusu Wang: geometric deep learning, graph learning, neural algorithmic reasoning, topological data analysis Gal Mishne: Manifold learning, Diffusion geometry, Computational neuroscience, Image processing and graph signal processing, and Applied harmonic analysis.
- Mentoring Style: In the first few weeks we will give some "lectures" on the background, together with reading / experimenting materials. Usually students form groups of around 3 students each to develop the capstone projects.
- Suggested Prerequisites:
- Summer Tasks: - https://arxiv.org/abs/2205.12454 - https://arxiv.org/abs/2302.04181
- Previous Project
Community-Centered Discrimination Audits of LLMs - Bias Rapid Action Teams
Stuart Geiger • sgeiger@ucsd.edu
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A18 6 seats Wednesday before noon
This capstone will work with community members to audit pretrained Large Language Models for discrimination and bias, using perturbation-based or controlled-experimental methods. These systematically vary a template prompt along a potential type of discrimination, then observe differences in outputs. For example, if you ask ChatGPT (or TritonGPT) to act as a college admissions reviewer, does an application's score change if it references the Mens vs Womens basketball team? Or being on the lacrosse versus basketball team? Or being from La Jolla versus San Ysidro? These methods are relatively simple from a statistics perspective, but the hard part is knowing what kinds of discrimination are of most concern to the people who will be impacted by model outputs and creating real-world template prompts that test for those concerns. This capstone will be centered around **talking and listening to real people** about their concerns with LLMs in real-world contexts, then using our data science expertise in a more consulting-style mode. If a team chooses university admissions, they might work with students, high school counselors, professors, and/or admissions staff. All students must take and pass the 3-hour UCSD/CITI IRB Human Subject Protection Training online course (Social and Behavioral Basic) by week 3 of Fall.
Read more
- About: I’m a social scientist with a background in the humanities, especially history and philosophy of science and technology, but I have enough expertise in computer science and data science to make trouble. I believe that data science systems should be fair, transparent, and accountable to the public, but that most are currently not. A lot of my research is in community-centered content moderation NLP systems for user-generated content, especially Wikipedia, where I formerly worked on their ML models and systems.
- Mentoring Style: I will be the point of contact and there every week, but may bring in collaborators and my grad student advisees. I intentionally do not run a "lab", but I do have a "constellation of collaboration." Students can choose their own particular context in which LLMs are deployed and which kinds of community members / impacted people they want to consult.
- Suggested Prerequisites:
- Summer Tasks: A recent example of a perturbation-based audit study: - https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0318500 For more readings, see - https://auditlab.stuartgeiger.com Take UCSD/CITI IRB Human Subject Protection Training online course (Social and Behavioral Basic), must complete by week 3 of Fall, but good to do it earlier. Register at citiprogram.org and see this video for how to register: - https://www.youtube.com/watch?v=hOAgfK93QXg
- Previous Project
Theoretical Computer Science
Barna Saha • barnas@ucsd.edu
TA: TBD
A19 4 seats Early morning on Friday
Reading theoretical papers related to fine-grained complexity and learning how to prove theorems. Fine-grained complexity seeks to explore the complexity of algorithms beyond the traditional coarse distinction between polynomial-time and NP-hard problems. The central idea is to investigate the relationships between different computational problems and identify those that are "hard" in a more fine-grained sense.
Read more
- About: Check here: https://en.wikipedia.org/wiki/Barna_Saha
- Mentoring Style: The students would need to be independent.
- Suggested Prerequisites:
- Summer Tasks: Read relevant papers that appeared in recent STOC/FOCS conference
- Previous Project
Hunting for Ghost Particles - Analyzing Time Series Data produced by Semiconductor Detectors
Aobo Li • aol002@ucsd.edu
TA: TBD
A20 4 seats Wednesday 3-4
Neutrinos are tiny particles that are almost like ghosts because they can pass through just about anything without being noticed. They're produced in huge numbers by the sun and other stars, but catching them is really tough because they hardly ever interact with other matters. Scientists use special, super sensitive equipment such as Semiconductor Detector to try and spot these sneaky particles and learn more about how the universe works. The Majorana Demonstrator experiment utilizes an array of these Semiconductor Detectors to capture neutrinos hidden in the time series data generated by these detectors. In this project, we will establish an analysis team dedicated to examining this time series data. The team will undertake multiple analytical tasks, including employing machine learning models for time series classification and regression, aiming to produce an energy spectrum akin to the one generated by the Majorana Demonstrator.
Read more
- About: I am Aobo Li (you can call me obo, like the musical instrument). I am a new faculty at HDSI & the Department of Physics. I earned my B.S. from UW Seattle and my PhD from Boston University, both in the field of physics. My research uses machine learning to squeeze out the maximum amount of information from ultra-sensitive radiation detectors, all in the quest to uncover extremely rare physics events in our universe.
- Mentoring Style: To achieve our final analysis goal—the detector spectrum—students will need to construct and train 3–5 machine learning models using a fully labeled dataset. One of these models will address a regression task, while the others will tackle binary classification, using 0/1 labels. An Analysis Coordinator (AC) will oversee the entire model-building process and document everything in a unified analysis document. Within the project, we will form subgroups; each will select a machine learning task, propose a model to accomplish it, and provide weekly updates during meetings to track progress. The AC and I will engage with each student weekly to discuss their tasks and provide feedback on their updates. Additionally, students will receive detailed assistance from the AC on coding and technical aspects, whereas I will focus on providing in-depth guidance to the AC.
- Suggested Prerequisites:
- Summer Tasks: The Majorana Demonstrator data we will analyze is already available online. Data Download Website: https://zenodo.org/records/8257027 Data Release Notes: https://arxiv.org/pdf/2308.1085 All students who wish to get involved in this project should make sure to read the Data Release Notes carefully. Students should also try to download the data and make sure they can extract information from it (the data is stored in .hdf5 file format). Machine Learning Prerequisite - Students should make sure they can design, run and validate machine learning models for classification and regression tasks, ideally using PyTorch to build and train simple neural networks. During the data analysis process, students will have the freedom to pick their own models to use. Analysis Coordinator - One of the enrolled students will be elected as the analysis coordinator (AC) of this project. AC will take a leadership role to coordinate model development among different subgroups and manage this project at a higher level. This will be an excellent leadership experience that can be highlighted on a student's CV. If you are interested in this position, please send an email to aol002@ucsd.edu. If no one volunteers, the advisor will appoint one student as the AC. Please be prepared to serve as the AC if you enroll in this project. Additional reading - - Nachman Undergraduate Thesis: https://drive.google.com/file/d/1oF8oiGke5SCVbKTbbPlNwxh9zYN_Nri4/view?usp=sharing (Please pay special attention to Section 3: Pulse Shape Parameter Pipeline) - Majorana Demonstrator Experiment: https://phys.org/news/2023-02-legacy-majorana.html
- Previous Project
Blockchain
Rajesh K Gupta • rgupta@ucsd.edu
TA: TBD
A21 6 seats Wed 10-11 (Saturday 10AM is a standing alternative)
Blockchains provide a platform for developing new distributed programs and workflow that provide for various services. It is particularly suited for services that involved asynchronous collaboration of diverse actors (human or agents) to achieve overall system objectives. Among the key capabilities are verifiability, non-volatility/immutability of various transactions as well enforcements of various dependencies in a provably correct manner. In this capstone project, you will explore one such service, design and implement it using smart contracts on a chosen platform (Solidity/Ethereum, Solana, Hyperledger etc). You may also consider building upon past projects such as those for GymCoin, RealEstate, etc.
Read more
- About: Rajesh Gupta serves as a founding director of the Halıcıoğlu Data Science Institute and as a distinguished professor of Computer Science and Engineering at UC San Diego. His research is in embedded and cyber-physical systems with a focus on sensor data organization and its use in optimization and analytics. Prof. Gupta holds Qualcomm Endowed Chair in Embedded Microsystems at UC San Diego and INRIA International Chair at the French international research institute in Rennes, Bretagne Atlantique. He is a Fellow of the IEEE, the ACM and the American Association for the Advancement of Science (AAAS).
- Mentoring Style: My mentoring is to listen your progress and plans on a weekly basis and lead you to think through alternatives.
- Suggested Prerequisites:
- Summer Tasks: Please study, review basics of Blockchain. Since the Smart Contract programming ecosystem is evolving, please research and practice with potential development platform for your project. You may look at the past projects for suggestions.
- Previous Project
Hardware Acceleration of ML Algorithms
Rajesh K Gupta • rgupta@ucsd.edu
TA: TBD
A22 6 seats Wed 11-12 (Sat 11-12 is an option as well)
Machine Learning Acceleration using Hardware such as FPGA refers to design and implementation of hardware blocks that are useful in either acceleration of application codes (such as manipulation of graph neural networks) or in acceleration of architectural mechanisms (such as prefetches, memory assists etc). In this project you will explore use of architectural mechanics that substantially speedup the selected ML codes.
Read more
- About: Rajesh Gupta serves as a founding director of the Halıcıoğlu Data Science Institute and as a distinguished professor of Computer Science and Engineering at UC San Diego. His research is in embedded and cyber-physical systems with a focus on sensor data organization and its use in optimization and analytics. Prof. Gupta holds Qualcomm Endowed Chair in Embedded Microsystems at UC San Diego and INRIA International Chair at the French international research institute in Rennes, Bretagne Atlantique. He is a Fellow of the IEEE, the ACM and the American Association for the Advancement of Science (AAAS).
- Mentoring Style: My mentoring is to listen your progress and plans on a weekly basis and lead you to think through alternatives.
- Suggested Prerequisites:
- Summer Tasks: Directed exploration, installation and use of necessary simulation tools for basic microarchitectures.
- Previous Project
Training Baby Language Models from Scratch
Alex Warstadt • awarstadt@ucsd.edu
TA: TBD
A23 6 seats Wednesday afternoon??
Large Language Models have an impressive ability to learn and use human language, but humans are still the state-of-the-art when it comes to learning language efficiently. We acquire language from 100 million words or less, whereas LLMs are now trained on 10s or *trillions* of words. The BabyLM Challenge (https://babylm.github.io/) is a competition centered around training small "BabyLMs" under constraints inspired by human language learning. The goal of a BabyLM submission is to train a model that learns language as data-efficiently as a human or that simulates properties of human learning and linguistic performance.
Read more
- About: Alex Warstadt is an Assistant Professor at UCSD with joint appointments in HDSI and the Department of Linguistics. He received his PhD in linguistics in 2022 from NYU under Samuel Bowman, and completed a postdoc in 2024 with Ryan Cotterell. Alex runs UCSD's Learning Meaning and Natural Language Lab (LeMN Lab) which is an interdisciplinary group that uses insights from linguistics to advance and interpret language models and uses advances in machine learning to answer scientific questions about the nature of language.
- Mentoring Style: This topic lends itself to multiple parallel projects, but a single large project is acceptable as well depending on group size. PhD students in LeMN Lab will be encouraged to co-supervise projects.
- Suggested Prerequisites:
- Summer Tasks: Replicate prior BabyLM studies, including training LMs from scratch and evaluating, launching jobs on a UCSD supercomputer cluster such as Nautilus, and tracking training progress and sweeps over several days or weeks using weights and biases.
- Previous Project
HCI and Reinforcement Learning for Hearing Loss Compensation
Harinath Garudadri • hgarudadri@ucsd.edu, hari@nadiworks.com
TA: TBD
A24 6 seats 4 to 5 PM, weekdays and Weekends
HCI: Improve the user interfaces for the hearing impaired using GUI, Voice User Interfaces, and Gesture-based Interfaces. Reinforcement: Balancing Exploration and Exploitation constrained by available (but growing) feedback data from users.
Read more
- About: I run a lab at UCSD that conducts fundamental and translational research at the intersection of Technology, Healthcare, and Education -- THELab. I am the Founder/CEO of Nadi Inc, a California C-Corp that commercializes translational research in academia. We are launching "breakthrough" Hearing Aids using a SaaS model, in time for the 2025 Holiday season. My interest in volunteering to mentor DSC Capstone is to solve the hard problems in the area of Hearing Aids, and restore the function of our ability to communicate in a natural language.
- Mentoring Style: I am assuming Capstones at HDSI may be different from the ones in ECE. I will have weekly presentations by a selected student and will count to grading.
- Suggested Prerequisites:
- Summer Tasks: - https://a.co/d/9bXIFSq (Book: Thinking in Systems) - PyTorch - NLTK
- Previous Project
Scrolling sound and talking hand
Victor Minces, Virginia de Sa • vminces@ucsd.edu
TA: TBD
A25 6 seats Any day after 10am, preferably Tuesday to Thursday
My team has been developing an audio listening interface that can transform the way people consume podcasts, audiobooks, and videos. People are increasingly acquiring, creating, and sharing knowledge through audio and video, rather than reading. A problem with audio information, as opposed to reading, is that it is very difficult to 'scroll'. For example, if you space out while listening to a podcast, it can be cumbersome to find the last moment you were paying attention. It can also be difficult to skim through audio, dynamically changing the playback speed, forwards and back, to find the information you need. We are looking to solve this problem by creating a more organic listening interface. The task is creative, the possibilities endless, and there is a lot to learn. Another direction of my team is a 'talking hand', which transforms hand movements into speech. It can also sing! https://talkinghandminces.netlify.app/
Read more
- About: Victor Minces studied fine arts and physics at the University of Buenos Aires, and obtained his Ph.D. in computational neuroscience at UCSD. He researched how the brain represents sensory stimuli, including light and rhythmic sound, and the cognitive basis of musical rhythm. He created a program making widely used web applications and hands-on activities for people to play, experiment, and learn about sound. Besides his science and technology endeavors, he is a sound artist and performer, leading audiences of hundreds of people to explore and create sound together. Recently, his team has been designing original sonic interfaces for people to create sound and music, listen to audiobooks, and communicate. HDSI's Associate Director de Sa is a leader in the fields of cognitive science, neuroscience, computer science, engineering, and data science. Her research utilizes multiple approaches to increase our understanding of how humans and machines learn to perceive the world around them. She earned her Ph.D. and master’s in Computer Science from the University of Rochester, and a bachelor’s degree in Mathematics and Engineering from Canada’s Queen’s University.
- Mentoring Style: My students typically work independently, but I am very supportive. I meet with them once a week, but I am open (and happy) to follow up more often. I have experienced undergrads working with me, who can help guide new students. Dr. De Sa, and perhaps her students, will also be involved in mentoring, specifically as it refers to machine learning.
- Suggested Prerequisites:
- Summer Tasks: - Research algorithms to slow down and speed up sound, such as vocoders and granular synthesis. Reproduce one if possible. - Learn about frequency decomposition, spend some time playing with https://spectrogram.sciencemusic.org/ - If you are interested in the talking hand, watch and carefully study this video: https://youtu.be/aFnWSBKImQU?si=hwO5GX55HA3oV6KD - And this video: https://youtu.be/fxYCk1t5DKE?si=Lrn7c_JzvD9ueTB2 - And if you can, watch the whole series. - Reach out to me and I can assign you some tasks depending on your current knowledge and interest.
- Previous Project
Trustworthy Machine Learning
Lily Weng • lweng@ucsd.edu
TA: TBD
A26 8 seats maybe Mon or Fri afternoon
This capstone is expected to be a fully research-oriented project led by student teams. The topics include: building interpretable deep learning models, automated mechanistic interpretability frameworks for deep vision models or large language models. Another potential topic is on the jailbreak attacks and other potential failure modes for LLMs.
Read more
- About: Lily Weng is an Assistant Professor in the Halıcıoğlu Data Science Institute at UC San Diego with affiliation in the CSE department. Her research interest is in machine learning and deep learning, with primary focus on Trustworthy AI.
- Mentoring Style: Students will form teams to conduct research projects. Student teams are expected to lead the research project. Students are expected to be proficient in PyTorch and deep learning libraries, and have experience in setting up deep vision models and open-sourced LLMs.
- Suggested Prerequisites:
- Summer Tasks: For students to be successful in this session, they need to be able to understand the following papers and be able to set up the code repositories successfully: 1. Concept Bottleneck Large Language Models: https://arxiv.org/pdf/2412.07992 2. Interpretable Generative Models through Post-hoc Concept Bottlenecks: https://arxiv.org/abs/2503.19377 3. Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities: https://arxiv.org/pdf/2410.18469
- Previous Project
Wildfire and Property Intelligence Modeling with Cotality
Ilyes Meftah, Lawrence Vulis, Peter Nagy • imeftah@cotality.com, lvulis@cotality.com, pnagy@cotality.com
TA: TBD
A27 12 seats Preferably Thursday or Friday mornings
This capstone will offer students the opportunity to select from three rich domains of applied data science in collaboration with Cotality: **Domain 1 — Wildfire Risk Intelligence: Advanced Modeling for Climate Resilience** Students will enhance wildfire catastrophe models by designing machine learning models to simulate fire intensity within wildfire perimeters. Focus is on improving hazard modeling to support insurance pricing, emergency planning, and resilience. **Domain 2 — Wildfire Risk Intelligence: Statistical Modeling of Wildfire Frequency** This project explores methods for assigning realistic event frequencies to wildfire footprints, matching historical damage patterns using statistical modeling and machine learning. It will provide exposure to risk quantification, spatial data processing, and policy-relevant analytics. **Domain 3 — Property Intelligence: Enhancing Geospatial Data Quality for Risk Assessment** Students will apply machine learning to improve land use classification across county boundaries and enhance data quality in nationwide parcel-level property databases. The goal is to refine features used across catastrophe models and climate analytics platforms. Across all domains, students will engage in hands-on work with real datasets, industry tools (Python/R, GIS), and catastrophe modeling techniques. The outputs are intended to directly improve Cotality’s modeling platforms and have measurable real-world impacts.
Read more
- About: Ilyes Meftah has been a data scientist and catastrophe modeler with Cotality for 13 years. With a strong background in mathematics and quantitative finance (holding multiple master's degrees from Paris, France universities), Ilyes has developed risk assessment models for wildfires, hurricanes, and earthquakes throughout his career. Recently, he has been focusing his efforts on quantifying wildfire mitigation measures to help communities located in high-risk areas. He is passionate about solving complex problems and sharing knowledge with others. When not working on catastrophe models, he enjoys hiking around the world with his family. Lawrence Vulis is a senior hazard scientist at Cotality, where he works on building physical and AI-based models of natural hazard risk to properties. Prior projects include machine learning-based classification of river delta geometry, satellite-based tracking of arctic lake spatiotemporal dynamics, linking satellite-derived beach dynamics with off-shore wave climate in Southern California, and a geospatial database/platform for machine learning-based permafrost mapping. His educational background is in Civil and Environmental Engineering, with a B.E. from The City College of New York and a Ph.D. from UC Irvine, with an extended internship and brief stint at Los Alamos National Lab. Outside of work he enjoys spending time with his wife and dog on beaches and trails. Peter Nagy has been with Cotality for 15 years, applying his passion toward big spatial data problems that occur with parcels, buildings, and geographic data relating to natural hazard risks. Prior experience includes the virtual earth (streetside) team with Microsoft, as well as multiple projects with Vexcel including SRTM processing, feature extraction from radar imagery, visualizations of raster and vector imagery like polarimetric SAR compositions, and building the RAMS Antarctic DEM. He studied at the University of Colorado in Boulder where he still lives, enjoying outdoor activities like hiking and skiing.
- Mentoring Style: Our mentoring approach combines structure with creativity in a collaborative environment. Weekly sessions will balance technical guidance with hands-on problem-solving. Students will have opportunities to interact with multiple catastrophe modeling experts at Cotality, gaining exposure to different perspectives and specialized knowledge. We believe learning works best when it's engaging and enjoyable, so we'll incorporate real-world applications and team-based challenges throughout the project. While we'll provide regular guidance and feedback, we value student initiative and will encourage independent exploration of solutions within our project framework. Our goal is to create an experience that's both intellectually stimulating and professionally valuable.
- Suggested Prerequisites:
- Summer Tasks: Suggested preparation: - Spatial Cluster Analysis (YouTube playlist): https://www.youtube.com/playlist?list=PLzREt6r1Nenk3L0ndufhYuwdrrfZqdsIA - Spatial Data Science General Topics (YouTube playlist): https://www.youtube.com/playlist?list=PLzREt6r1NenmFyTw8v2JZpEE4PZGNi5Ht - Python GIS Textbook (Part II and III): https://pythongis.org/part2/index.html - R users: Get comfortable with terra & sf libraries and spatial point pattern analysis More detailed resources will be shared as needed.
- Previous Project
Comparing Maps from Human Brain Imaging
Armin Schwartzman • armins@ucsd.edu
TA: TBD
A28 8 seats Wednesday 3:30-4:30PM (with some flexibility around this)
The organization of the human brain is incredibly complex. Over the past decades in neuroscience, researchers have measured the human brain with increasing spatial and temporal resolution–using imaging, recording, tracing, and sequencing technologies. This has yielded detailed maps of the spatial structure (e.g. cortical thickness, receptor densities, gene expression) and functional properties (e.g., task activations, functional connectivity) of the brain. Comparing these brain maps is fundamental to understanding the complexities of brain organization, but presents unique challenges. Measures of spatial associations between brain maps are strongly influenced by spatial autocorrelation, leading to inflated false positives if not properly accounted for. This project explores methods in spatial statistics for testing the association between brain maps, enabling researchers to more accurately interpret map-to-map similarities and differences. We will learn how to: (1) work with neuroimaging data in Python, (2) understand and implement existing methods for comparing maps while accounting for spatial autocorrelation, (3) and apply these methods to open-access datasets.
Read more
- About: With an undergraduate degree in electrical engineering, I discovered statistics for my PhD and have been doing data science since then (even when it wasn't called by that name). Much of my work involves signal and image analysis, but I'm interested in many theoretical and applied problems, even philosophical. Outside of academia, I like doing music, dancing, swimming, surfing, and more.
- Mentoring Style: Mentoring will involve data science PhD student Gabriel Riegner. Students are expected to take ownership over the project. This implies taking initiative in learning about the topic (from the assigned material and other sources), implementing the methods in code, being resourceful when needing help, and asking questions. Students are expected to put in their best effort, plan their time over the quarter, make substantial progress each week, report on it each week, and come up with an action plan for the next steps (as opposed to waiting for the mentor to give instructions). In other words, be independent and ask for help when needed.
- Suggested Prerequisites:
- Summer Tasks: Reading material: - Comparing spatial null models for brain maps, Markello and Misic 2021 (paper: https://pubmed.ncbi.nlm.nih.gov/33857618/, code: https://markello-spatialnulls.netlify.app/#) - Neuromaps: structural and functional interpretation of brain maps, Markello et. al. 2022 (paper: https://pubmed.ncbi.nlm.nih.gov/36203018/, code: https://netneurolab.github.io/neuromaps/)
- Previous Project
Differentially private synthetic telemetry data with modern generative AI
Yu-Xiang Wang, Bijan Arbab • yuxiangw@ucsd.edu, barbab@ucsd.edu
TA: TBD
A29 8 seats Monday afternoon 4pm (tentative)
Differentially private synthetic data generation plays a crucial role in enabling data sharing and analysis while preserving individual privacy. By creating artificial datasets that statistically resemble real data without revealing any specific individual's information, this approach allows organizations to comply with privacy regulations such as GDPR and HIPAA. It fosters innovation and collaboration by making sensitive data accessible to researchers, developers, and analysts without exposing personal details. Furthermore, the use of differential privacy techniques ensures that the risk of re-identifying individuals remains mathematically bounded, providing strong, quantifiable privacy guarantees. This balance between utility and privacy is essential for advancing fields like healthcare, finance, and social science in a responsible and ethical manner. In this specific project, you will work with me and HDSI Industry Fellow Dr. Bijan Arbab to
Read more
- About: I am a faculty member of the Halıcıoğlu Data Science Institute at UC San Diego, also affliated with the CSE department. Broadly speaking, my students and I apply math and computing to (1) design faster, stronger and more efficient ML algorithms with provable guarantees (2) solve societal challenges (e.g., data privacy, abuse prevention) that emerge in the AI era. Our recent focus include watermarking generative AI, making differential privacy practical, bridging offline and online RL, developing a theory of adaptivity in deep learning.
- Mentoring Style: Guiding the learning and research of students through weekly meetings with students who lead the projects.
- Suggested Prerequisites:
- Summer Tasks: Read about differential privacy (see the course I taught in Fall 2024 and the references therein https://cseweb.ucsd.edu/~yuxiangw/classes/DSC291-2024Fall/)
- Previous Project
Agents of Change - Knowledge Graphs Revolutionizing Life Sciences 🧪🧬
Abed El-Husseini, Balaji Veeramani • aelhusseini@deloitte.com, baveeramani@deloitte.com
TA: TBD
A30 8 seats
Knowledge graphs are powerful tools for representing complex biological systems by modeling entities—such as genes, proteins, diseases, and drugs—and the relationships between them. At the same time, agentic systems are capable of autonomous reasoning and goal-directed action. Combining these approaches enables the development of systems that can navigate vast biomedical knowledge networks, supporting researchers in hypothesis generation and data exploration. This group is for students interested in this intersection, drawing on principles from artificial intelligence, knowledge representation, and the life sciences to create intelligent agents that can explore biomedical data and deliver meaningful insights.
Read more
- About: Abed – Abed is an Applied AI Manager at Deloitte Consulting, specializing in Generative AI applications. Passionate about teaching, he has served as a business case mentor and capstone instructor for HDSI. A proud graduate of The Ohio State University, Abed now lives in Austin, Texas—the live music capital of the world—with his wife and son 🤠🎸. He's an avid runner and a dessert enthusiast, in that order. Balaji Veeramani is a specialist leader within Deloitte Consulting, helping organizations develop and adopt AI solutions responsibly. Balaji has been leading AI/ML teams developing deep learning, machine learning, data science and GenAI based solutions, for life sciences, healthcare, diagnostics, agriculture, investment management, and logistics organizations. Balaji received his Ph.D. in Biomedical Engineering from Johns Hopkins University, and a Masters in Electrical Engineering (signal processing) from Arizona State University.
- Mentoring Style: casual, fun, engaging
- Suggested Prerequisites:
- Summer Tasks: - Knowledge Graphs: https://arxiv.org/pdf/2003.02320 - Biomedical Knowledge Graph: A Survey of Domains, Tasks, and Real-World Applications: https://arxiv.org/pdf/2501.11632 - AI Engineering (on building and evaluating agentic solutions): https://www.oreilly.com/library/view/ai-engineering/9781098166298/
- Previous Project
AI-Driven Jail Population Optimization System for San Diego County Sheriff's Office
Ashish Kakkad, Dave Blackwell, Christopher Lawrence • Christopher.Lawrence@sdsheriff.gov; David.Blackwell@sdsheriff.gov; Ashish.Kakkad@sdsheriff.gov
TA: TBD
A31 4 seats Fridays at 10am
SDSO struggles with underutilized bed space caused by complex classification restrictions. The existing manual or semi-automated decision-making process limits Jail Population Management Unit (JPMU)’s ability to respond quickly and accurately to changes in jail population demographics and availability. A more intelligent, automated system is required to streamline this process and improve overall efficiency.
Read more
- About: Christopher Lawrence began his service with the San Diego County Sheriff's Office as a Deputy Explorer in 1999 and officially joined as a Deputy Sheriff in 2005, serving in various assignments including the Human Trafficking Task Force and Criminal Intelligence Detail. He was promoted to Sergeant in 2016, then to Lieutenant in 2021, taking on leadership roles in multiple units including the Threat Assessment Group and Communications Center. In 2023, he became Captain of the North Coastal Station and later led the Major Crimes Division. By November 2024, he was promoted to Commander, overseeing all seven of the Sheriff's detention facilities. Christopher holds a bachelor’s degree in Criminal Justice Management and is a graduate of several advanced leadership programs; he is also a husband and father of two daughters and a son. Ashish "Yosh" Kakkad is the Chief Technology Officer (CTO) for the San Diego County Sheriff's Office, a position he has held since 2013 after joining the department in 2002. As CTO, he oversees the technology budget, contracts, and long-term technology strategy, ensuring alignment with the department’s strategic goals. He also manages the Wireless Services Division, which supports over 116 agencies and more than 20,000 radios across San Diego and Imperial counties. Prior to his current role, Yosh led the development of a regional data-sharing platform used by over 70 agencies and 12,000 users. A first-generation Indian immigrant and U.S. Air Force veteran, he holds a Bachelor’s in Computer Science, an MBA in IT Management, and is a devoted husband and father of three. David Blackwell is a Lieutenant with the San Diego County Sheriff's Office. He leads the Operational Technology Unit within the Technology Services Division. In his role as a leader of OTU, Lt. Blackwell is responsible for aligning and supporting the Office's technology roadmap with operational priorities as well as implementation and rollout of any tech that impacts operations. Lt. Blackwell has been with the Sheriff's office for over 20 years and has extensive operational knowledge as well as has developed several critical enterprise applications.
- Mentoring Style: The students will work closely as a team with the mentor and domain experts as needed. They will have access to the resources necessary for successful capstone.
- Suggested Prerequisites:
- Summer Tasks: https://lao.ca.gov/reports/2019/4023/inmate-classification-050219.pdf
- Previous Project
Explorations on in-context learning in LLMs
Prof. Arya Mazumdar • amazumdar@ucsd.edu
TA: TBD
A32 8 seats Tuesdays at 1pm
In-context learning is a distinctive capability of large language models (LLMs) that enables them to perform tasks by leveraging examples or instructions provided directly within the input prompt, without any parameter updates or explicit fine-tuning. This means the model can adapt its behavior based on the context given in a single interaction, such as demonstrating a task through a few examples (few-shot learning) or simply providing a clear instruction (zero-shot learning). In-context learning allows LLMs to generalize across tasks and domains, making them highly flexible and efficient tools for a wide range of applications, from language translation to code generation. This approach contrasts with traditional machine learning paradigms that require model retraining or fine-tuning to incorporate new information. In-context learning raises several fascinating theoretical questions, such as: 1) Mechanism of Learning Without Weight Updates: How do LLMs perform complex reasoning or task adaptation based purely on input text, without changing internal weights? What internal representations support this kind of "learning"? 2) Implicit vs. Explicit Learning: To what extent does in-context learning mirror classical learning paradigms like gradient descent? Are LLMs simulating optimization procedures internally when given examples? 3) Limits of Generalization: What are the boundaries of what in-context learning can achieve? For instance, how well can models generalize to truly novel tasks or concepts they’ve never seen, even as examples?
Read more
- About: Arya Mazumdar is a Professor at the University of California, San Diego. His research interests include coding theory, information theory, statistical learning, and distributed optimization. He was a recipient of multiple awards, including the Distinguished Dissertation Award for Ph.D. Thesis in 2011, NSF CAREER Award in 2015, EURASIP Best Paper Award in 2020, and IEEE ISIT Jack K. Wolf Student Paper Award in 2010. He was a Distinguished Lecturer of the Information Theory Society for 2023-24, and currently serves as an Associate Editor for IEEE TRANSACTIONS ON INFORMATION THEORY, an Area Editor for Now Publishers Foundation and Trends in Communication and Information Theory series, and an Action Editor of Transactions on Machine Learning Research.
- Mentoring Style: Every week students must report progress. Every single individual should clearly articulate what they did the week.
- Suggested Prerequisites:
- Summer Tasks: Read the above topics
- Previous Project
Deep learning for understanding microbiome results
Prof. Rob Knight • rknight@ucsd.edu
TA: TBD
A33 8 seats TBA
This group is for students interested in exploring applications of deep learning to the microbiome space. The first few weeks of the course will be devoted to understanding what the microbiome is and past approaches to analyzing it, together with opportunities to apply deep learning techniques in various ways to analysis of the microbial DNA sequences, communities, and/or annotations. The specific results that will be re-analyzed and techniques used will be driven by student interests. For example, past capstone classes have focused on the microbiome as a source of health disparities in Hispanic and Latino populations, use of protein language models to identify antimicrobial peptides, and use of DNA language models to identify mutations associated with drug resistance or to analyze relationships between entire microbial communities and phenotype. Currently emerging opportunities include development of digital twin models of humans, spatial sequencing, and long-read metagenomics.
Read more
- About: While researching the origins of life as a postdoc in Boulder, Colorado, I developed algorithms for comparing RNA that turned out to be enabling technologies for the whole microbiome field. I moved to UCSD at the end of 2014 to lead the Center for Microbiome Innovation, and my lab focuses on developing new technologies to read out and understand complex microbial communities in the environment and the human body. I gave a TED talk that has been viewed > 2 million times and wrote two popular books on the microbiome. Outside academia, I like to travel, cook, hike, and paddleboard.
- Mentoring Style: Capstone group will run as separate, focused entity, with relevant members of lab (grad students, postdocs) providing additional perspective/co-mentorship. Attendance of lab meetings/code reviews for the whole lab is optional.
- Suggested Prerequisites:
- Summer Tasks: "https://www.ted.com/talks/rob_knight_how_our_microbes_make_us_who_we_are?language=en Familiarity with Python, Pandas and PyTorch are essential. Depending on project, TensorFlow, scikit-bio and STAN could be useful."
- Previous Project
From Data to Dispatch - Optimizing SDG&E Field Services
Phi Nguyen, Chuck Hahm, Fatemeh Aarabi • pnguyen@sdge.com, CHahm@sdge.com, FAarabi@sdge.com
TA: TBA
A34 12 seats TBD
This capstone offers students a unique opportunity to partner with San Diego Gas & Electric (SDG&E) to improve how field services—like metering, inspections, and emergency repairs—are delivered across the region. Students will work directly with field technicians and analysts to explore how data science can optimize truck dispatches, reduce operational costs, and enhance safety, all while improving the customer experience for San Diegans. Using real-world utility data, students may apply machine learning, geospatial analysis, and optimization techniques to solve challenges such as predicting equipment failures, streamlining technician routes, or identifying service anomalies. This project is ideal for students eager to connect data with real-world impact, gain experience in applied analytics, and contribute to a more efficient and customer-focused energy future.
Read more
- About: Dr. Nguyen graduated from UCSD with a Ph.D. in Materials Science and Engineering, where he developed nanomaterials for clean energy applications. He then worked for several years as a consultant in the energy sector, where his focus was on using data to support policies that promote clean energy and energy efficiency. Dr. Nguyen joined SDG&E to focus on developing models to mitigate wildfire risk in California and has since expanded his work to other areas that benefit San Diego communities. Charles ("Chuck") Hahm is a Data Scientist in SDGE's Customer Field Service organization. His data science experience spans a range of industries, including cybersecurity, customer analytics, sensor analytics, medical diagnostics, and image processing. He has served as adjunct faculty and course developer in National University's graduate analytics program. In the government sector, he has served as Principal Investigator for SBIR (Small Business Innovative Research) grants for the U.S. Navy, U.S. Air Force, and National Institutes of Health. Chuck holds a master's degree in electrical engineering from the Illinois Institute of Technology and a bachelor's degree from the University of Illinois at Chicago. Dr. Fatemeh Aarabi holds a PhD in operations research from State University of New York at Buffalo. During her doctoral studies, she focused on applied operations research methods like routing and scheduling algorithms with applications in urban systems. After graduation she joined industry to develop optimization frameworks for emergency management systems, working on optimization algorithms and ML predictive methods to reduce the EMS response time. In 2022 Fatemeh joined SDGE as a data scientist where she has focused on developing models to mitigate wildfire risk in California.
- Mentoring Style: The student group will be a stand-alone unit at SDG&E led by Mentors. Mentors will first work with students to understand utility space, and then schedule time with other SDG&E staff who will provide tours, field visits, and other utility-specific training. Students will also be introduced to other data scientists and engineers at SDG&E who are available for support on an as-needed basis throughout the duration of the project. However, once an introduction is made, it will be up to the students to reach out to staff when support is needed. Students will be encouraged to present their ideas to staff members beyond the mentors.
- Suggested Prerequisites:
- Summer Tasks: None
- Previous Project
Causal Copilot
Biwei Huang • bih007@ucsd.edu
TA: TBA
A35 8 seats Friday afternoon
This domain focuses on causal discovery—the process of learning cause-and-effect relationships from data—and the development of a Causal Copilot, an AI assistant that helps users automatically learn and interpret causal relationships. Students will explore methods to learn causal graphs from observational data, simulate interventions, and answer “what-if” questions across domains like health, economics, and science. The project combines causal inference, machine learning, and interactive AI systems.
Read more
- About: I am an assistant professor at HDSI. My research focus is causal discovery and causal AI.
- Mentoring Style: May involve my PhD students
- Suggested Prerequisites:
- Summer Tasks: Reading: 1. Causal-Copilot: An Autonomous Causal Analysis Agent. https://arxiv.org/pdf/2504.13263 2. The PC algorithm in causal discovery
- Previous Project
Sepsis — Using Clinical Healthcare Data Science to Identify and Combat an Infectious Killer
Kyle Shannon • kshannon@ucsd.edu
TA: TBA
A36 4 seats Thursdays 1:30–2:30 PM
Sepsis is a life-threatening condition caused by the body’s extreme response to an infection. Early detection and intervention are crucial for improving patient outcomes. This project aims to develop a radiographic-enhanced clinical decision support system for early sepsis detection and risk assessment. The system leverages chest X-ray images and patient metadata to predict the probability of sepsis development within specific time frames after the X-ray is taken. The proposed pipeline consists of two main components: 1) A ResNet model that predicts lung anomalies based solely on chest X-ray images 2) A CatBoost model that combines the output of the first model with patient vitals and other relevant metadata to predict whether the patient is at risk of sepsis. The project involves extensive data engineering to preprocess and integrate the MIMIC-IV and MIMIC-CXR datasets.
Read more
- About: Hi 👋 I’m Kyle Shannon, as a professional in the public health and data science fields, I am dedicated to improving healthcare accessibility and enhancing patient outcomes, particularly in rural America. My journey began at UCSD, where I studied in the CogSci department as an undergraduate and discovered my passion for Data Science when it was still an emerging field (2013). I later pursued my master's degree in Data Science at UCSD, and eventually co-founded a startup focused on healthcare access. My enthusiasm lies in data science projects that directly impact patient health outcomes, and I maintain a keen interest in cognitive neuroscience and tiny ML systems. Outside of work, you can find me on a tennis court or delighting in the ambiance of a cozy cafe while tackling projects.
- Mentoring Style: My goal is to create a capstone experience that emulates a practical "job" setting, guiding students in effectively interacting with managers and data science leads, asking relevant questions, and fulfilling their responsibilities. I may assume various roles (e.g., DS lead, stakeholder, hospital admin, manager) to offer diverse perspectives. I incorporate a business angle to discuss the project's broader context, encouraging students to envision their work in scenarios such as product development or hospital consultancy. This approach helps them grasp real-world applications and develop a compelling narrative for their projects. I prioritize accessibility for my students throughout the week, for example, via Discord, and may involve domain experts for them to interview and learn from professionals in ICUs and EHR data. This context adds valuable insight and humanizes the data/system. I often hold informal meetings with my students over coffee to discuss progress and answer questions. Occasionally, I expect them to provide progress reports and mini-presentations, simulating a real-world organizational experience.
- Suggested Prerequisites:
- Summer Tasks: The following are recommended summer domain readings and tasks. Getting through some or all of these, especially if you are a bit unfamiliar with the domain, would be a good idea and help you hit the ground running in the fall. I will be available during the summer to meet with you as a group once or twice if you wish. For clarity, during the summer, the three areas I recommend focusing on would be: - Familiarizing yourself with EHR data - Learning about the MIMIC dataset - Beginning to understand a bit more about clinical critical care in an ICU
- Previous Project
Quantifying the credibility of large language model outputs
Yian Ma • yianma@ucsd.edu
TA: TBA
A37 8 seats TBD
Accurate uncertainty quantification in large language models (LLMs) is essential for providing credible confidence estimates over their outputs. We will explore how to quantify the uncertainty and credibility of the LLM outputs. Specifically, we will explore prompt perturbation methods over open source LLM models.
Read more
- About: I am an assistant professor at the Halıcıoğlu Data Science Institute and an affliated faculty member at the Computer Science and Engineering Department of University of California San Diego. Prior to UCSD, I spent a year as a visiting faculty at Google Research. Before that, I was a post-doctoral fellow at EECS, UC Berkeley. I completed my Ph.D. at University of Washington and obtained my bachelor's degree at Shanghai Jiao Tong University.
- Mentoring Style:
- Suggested Prerequisites:
- Summer Tasks: Please read the following papers and prepare to lead a discussion on them at the beginning of the fall quarter: https://arxiv.org/pdf/2403.02509 https://arxiv.org/pdf/2406.02543 https://openreview.net/pdf?id=p6jsTidUkPx
- Previous Project
Deep learning analysis and augmentation of medical images
Albert Hsiao • hsiao@ucsd.edu
TA: TBA
A38 10 seats Mondays at 11am
This group will develop hands-on skill sets for developing deep learning algorithms for medical imaging. In the first quarter, students will reproduce our lab's prior results in chest radiography, including the estimation blood serum markers of heart failure from chest radiographs, using standard CNN architectures. In the second quarter, students will investigate new approaches for interfacing specialized CNN image encoders with strategies that interrogate algorithm explainability.
Read more
- About: Albert Hsiao, MD, PhD is a cardiothoracic radiologist trained in engineering at Caltech and bioengineering and bioinformatics in the UC San Diego Medical Scientist Training Program (MSTP). He completed his residency and fellowships in Interventional Radiology and Cardiovascular Imaging at Stanford before returning to UC San Diego as faculty in Radiology, where he leads advanced cardiovascular imaging and the Augmented Imaging and Data Analytics (AiDA) research laboratory. While a radiology resident at Stanford, he co-founded Arterys, a cloud-native software company to bring 4D Flow MRI and artificial intelligence technologies to market. He continues to partner with industry to develop and create new imaging technologies to improve diagnosis and management of disease.
- Mentoring Style: Students will meet weekly with me and additionally interface with scientists, post-doctoral fellows and graduate students in the lab.
- Suggested Prerequisites:
- Summer Tasks: May be worthwhile reviewing this paper which will be a substantial component of the first quarter https://ieeexplore.ieee.org/document/9768796
- Previous Project