Home page for the Course Web Page for the 2025-26 Edition

** Read the information at the top of the page, then scroll down to see information about each domain. **

Domain Descriptions

DSC Capstone, 2025-26 @ UC San Diego

Overview

Welcome to the capstone program! The capstone program is a two-quarter sequence (Fall 2025 and Winter 2026) in which you will be mentored by a faculty member or industry expert in their domain of expertise. By the end of Quarter 2, you will design and execute a project from that domain in teams. You can see the projects from last year at dsc-capstone.org/showcase-25.

At a high level, here’s how the capstone program is organized:

In Quarter 1 (DSC 180A), you gain background information in your mentor’s domain, by means of replicating a known result. By the end of Quarter 1, you will have completed a replication project (known as the “Quarter 1 Project”) and will have a proposal for a more independent project (known as the “Quarter 2 Project”, or the capstone project).
In Quarter 2 (DSC 180B), you execute the Quarter 2 Project you proposed at the end of Quarter 1.

Enrollment

Enrollment will begin soon. The available domains are not listed on the Schedule of Classes; instead, they are detailed below. Most domains are run by UCSD faculty, but some are run by industry partners (denoted with an Industry Partner badge).

Use the information here to choose the domain you’d like to enroll in. Once you’ve chosen a domain, all you need to do is enroll in the corresponding discussion section for DSC 180A once registration is open, space permitting. Note that you cannot change domains between DSC 180A and DSC 180B.

All of the information here – domain offerings, section times, descriptions, summer tasks, etc. – is subject to change as mentors provide us with more information.

How should I choose a domain?

You should aim to choose a domain that suits your interests and preparation. By clicking the Read more button underneath a domain, you’ll get to learn more about the mentor, their mentoring style, the prerequisites that they’d like their students to have, tasks that they’d like their students to work on over the summer, and their students’ capstone projects in previous years (if any).

✅ Good reasons to choose a domain:

You are sufficiently prepared.
You find the topic interesting and would be motivated to study it.
You are likely to work well with the mentor given their mentoring style and research background.

❌ Bad reasons to choose a domain:

The mentor has good teaching evaluations.
The industry partner is a big company.
The section has space for you and your friends at a time that you like.

Everything you produce for the capstone will have to be public on the internet for the rest of eternity with you and your mentor’s names attached to it – you want your capstone work to be something that you’re proud of and can talk about on job and graduate school applications. Who do you want writing you a recommendation letter?

What happens in DSC 180A?

In addition to meeting with your mentor each week, there will also be methodology instruction delivered by the capstone coordinator and the methodology course staff. However, the majority of this instruction will occur asynchronously, in the form of readings (like this one). This means that you can mostly ignore the lecture and lab times that appear for DSC 180A on the Schedule of Classes. A few of the lecture slots may be used for the capstone coordinator’s office hours or for one-off guest lectures, but we don’t plan to use the majority of the times.

All prerequisites for DSC 180A will be strictly enforced. The prerequisites for DSC 180A can be found here.

Note that since DSC 180A and DSC 180B are both 4-unit courses, you should expect to spend 12 hours a week on capstone-related work each quarter. Plan your class schedule accordingly – try not to take several time-consuming classes alongside the capstone.

Who is overseeing the capstone?

With any questions about the capstone sequence itself, feel free to email Umesh Bellur (ubellur@ucsd.edu) for now.

With any questions about the content of a particular domain, contact the mentor. With any questions about enrollment, please contact Student Affairs in the VAC.

(back to the outline)

Filter by subject area:

AI/ML Systems
🗣️ Language Models
🧠 Theoretical Foundations
⚙️ Applied Data Science

AI/ML Systems

(back to the outline)

Data Valuation & Curation for Trustworthy AI
Babak Salimi •
TA: TBA
A01 8 seats Friday 3:00 – 4:00 PM PT

Modern machine-learning pipelines increasingly live or die on data decisions—what to keep, what to toss, what to label next, and how much each record is “worth.” This domain surveys the growing toolkit of data-valuation and data-centric AI methods—Data Shapley (ICML 2019), influence functions, MMD/Wasserstein-based coresets, active-learning and subset-selection heuristics—that assign a quantitative score to every example’s marginal impact on model accuracy, robustness, and privacy risk. We’ll examine how those scores power practical tasks such as pruning noisy or duplicated records, budgeting scarce labeling effort, spotting harmful outliers, and auditing publicly released datasets. Students will dive into open benchmarks like WILDS (https://wilds.stanford.edu/) for distribution shifts and the DataPerf challenges (https://dataperf.org/) for curation leaderboards, replicate baseline valuation techniques in Quarter 1, and use the insights—and gaps—they uncover to propose original data-selection or curation projects for Quarter 2.

Read more

About: Babak Salimi is an Assistant Professor in the Halıcıoğlu Data Science Institute at UC San Diego. His research lies at the intersection of data management and machine learning, with a focus on building reliable, safe, and robust data-driven systems. He develops methods and tools that enhance the transparency and dependability of algorithmic decision-making, empowering practitioners to make informed and confident choices.
Mentoring Style: Capstone students will operate as an independent cohort: we will meet for a dedicated one-hour session each week where I lead mini-lectures, paper discussions, and milestone check-ins. While my PhD students are not formally part of the section, I may occasionally invite them to hold optional office-hour–style drop-ins for coding or tooling questions. I’ll be hands-on during project framing, data wrangling, and experimental design, then step back so teams can drive their own experiments and insights, checking progress and providing strategic guidance every week.
Suggested Prerequisites:
Summer Tasks: Readings — to ground everyone in core data-valuation methods: • “Data Shapley: Equitable Valuation of Data for Machine Learning” (ICML 2019) • “LAVA: Data Valuation without Pre-Specified Learning Algorithms” (ICLR 2023) • “Adversarial Attacks on Data Valuation” (NeurIPS 2024) Hands-on — pick one: Clone the LAVA GitHub repo and run the quick-start script on a small UCI dataset, producing a ranked list of data values.
Previous Project

Hardware Acceleration of ML Algorithms
Rajesh K Gupta • rgupta@ucsd.edu
TA: TBD
A02 6 seats Wed 11-12 (Sat 11-12 is an option as well)

Machine Learning Acceleration using Hardware such as FPGA refers to design and implementation of hardware blocks that are useful in either acceleration of application codes (such as manipulation of graph neural networks) or in acceleration of architectural mechanisms (such as prefetches, memory assists etc). In this project you will explore use of architectural mechanics that substantially speedup the selected ML codes.

Read more

About: Rajesh Gupta serves as a founding director of the Halıcıoğlu Data Science Institute and as a distinguished professor of Computer Science and Engineering at UC San Diego. His research is in embedded and cyber-physical systems with a focus on sensor data organization and its use in optimization and analytics. Prof. Gupta holds Qualcomm Endowed Chair in Embedded Microsystems at UC San Diego and INRIA International Chair at the French international research institute in Rennes, Bretagne Atlantique. He is a Fellow of the IEEE, the ACM and the American Association for the Advancement of Science (AAAS).
Mentoring Style: My mentoring is to listen your progress and plans on a weekly basis and lead you to think through alternatives.
Suggested Prerequisites:
Summer Tasks: Directed exploration, installation and use of necessary simulation tools for basic microarchitectures.
Previous Project

Trustworthy Machine Learning
Lily Weng • lweng@ucsd.edu
TA: TBD
A03 8 seats Monday at 4pm

This capstone is expected to be a fully research-oriented project led by student teams. The topics include: building interpretable deep learning models, automated mechanistic interpretability frameworks for deep vision models or large language models. Another potential topic is on the jailbreak attacks and other potential failure modes for LLMs.

Read more

About: Lily Weng is an Assistant Professor in the Halıcıoğlu Data Science Institute at UC San Diego with affiliation in the CSE department. Her research interest is in machine learning and deep learning, with primary focus on Trustworthy AI.
Mentoring Style: Students will form teams to conduct research projects. Student teams are expected to lead the research project. Students are expected to be proficient in PyTorch and deep learning libraries, and have experience in setting up deep vision models and open-sourced LLMs.
Suggested Prerequisites:
Summer Tasks: For students to be successful in this session, they need to be able to understand the following papers and be able to set up the code repositories successfully: 1. Concept Bottleneck Large Language Models: https://arxiv.org/pdf/2412.07992 2. Interpretable Generative Models through Post-hoc Concept Bottlenecks: https://arxiv.org/abs/2503.19377 3. Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities: https://arxiv.org/pdf/2410.18469
Previous Project

Causal Copilot
Biwei Huang • bih007@ucsd.edu
TA: TBA
A04 8 seats Friday afternoon

This domain focuses on causal discovery—the process of learning cause-and-effect relationships from data—and the development of a Causal Copilot, an AI assistant that helps users automatically learn and interpret causal relationships. Students will explore methods to learn causal graphs from observational data, simulate interventions, and answer “what-if” questions across domains like health, economics, and science. The project combines causal inference, machine learning, and interactive AI systems.

Read more

About: I am an assistant professor at HDSI. My research focus is causal discovery and causal AI.
Mentoring Style: May involve my PhD students
Suggested Prerequisites:
Summer Tasks: Reading: 1. Causal-Copilot: An Autonomous Causal Analysis Agent. https://arxiv.org/pdf/2504.13263 2. The PC algorithm in causal discovery
Previous Project

Language Models

(back to the outline)

Interplay between Machine Unlearning and Optimization
Jun-Kun Wang • jkw005@ucsd.edu
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A05 4 seats Fridays at 1PM.

“Machine Unlearning” concerns the scenario in which a model trained by an algorithm on a dataset is updated to respond to a request to "delete" certain sample points used in its training. The motivation arises due to the recent success of Large Language Models (LLMs) and other foundation models that are potentially trained on large corpora of data, some of which might contain copyrighted data points. Developing methods to tackle this problem has become an urgent need, partially due to recent regulations such as the EU's *Right to be Forgotten*. In this project, we will first review related works, e.g., [1]-[4]. We will then consider developing our own methods. Students working on this project will be required to implement and conduct comprehensive experiments to test the proposed methods using Python/PyTorch. References: [1] TOFU: A Task of Fictitious Unlearning for LLMs Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, J. Zico Kolter. Conference on Language Modeling (COLM) 2024 [2] Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition. Triantafillou et al. arXiv:2406.09073 [3] Exact Unlearning of Finetuning Data via Model Merging at Scale Kevin Kuo, Amrith Setlur, Kartik Srinivas, Aditi Raghunathan, Virginia Smith. MCDC @ ICLR 2025 [4] Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning. Ruiqi Zhang. Licong Lin, Yu Bai, Song Mei. Conference on Language Modeling (COLM) 2024

Read more

About: I am an assistant professor at HDSI and ECE. My research is centered around optimization and its connections with statistics and machine learning.
Mentoring Style: Students will be expected to use Python/PyTorch to implement their algorithms (and should be able to code).
Suggested Prerequisites:
Summer Tasks: Try to understand the materials on this website https://unlearning-challenge.github.io/. We will test our algorithms by following the experimental setup and the datasets detailed on this site.
Previous Project

Open LLM Training, Inference, and Infrastructure
Hao Zhang • haz094@ucsd.edu
TA: Zoom
A06 8 seats Mondays 3-4pm

The rapid advancement of large multimodal models has revolutionized AI systems, resulting in unprecedented levels of intelligence as seen in OpenAI’s GPT-4. However, despite its performance, the training and architecture details of GPT-4 remain unclear, hindering research and open-source innovation in this field. In this project, we'll explore three relevant areas to LLMs: - On the system side: infrastructure for scalable training and high-throughput serving with advanced memory management and parallelization techniques. - On the model side: build multimodal model close to ChatGPT quality, which can also interact with the real world by taking actions and using tools. - On the data and benchmark side: develop a highly curated data sets and benchmark platform with novel data augmentation, data filtering, and ranking methods.

Read more

About: Hao Zhang is an Assistant Professor in Halıcıoğlu Data Science Institute and the Department of Computer Science and Engineering at UC San Diego. Before joining UCSD, Hao was a postdoctoral researcher at UC Berkeley working with Ion Stoica (2021 - 2023). Hao completed his Ph.D. in Computer Science at Carnegie Mellon University with Eric Xing (2014 - 2020). During PhD, Hao took on leave and worked for the ML platform startup Petuum Inc (2016 - 2021). Hao's research interest is in the intersection area of machine learning and systems. Hao's past work includes vLLM, Chatbot Arena, Vicuna, Alpa, Poseidon, Petuum. Hao’s research has been recognized with the Jay Lepreau best paper award at OSDI’21 and an NVIDIA pioneer research award at NeurIPS’17. Hao also cofounded the company LMNet.ai (2023) which has joined Snowflake since November 2023, and the nonprofit LMSYS Org (2023) which maintains many popular open models, evaluation, and systems.
Mentoring Style: hands off
Suggested Prerequisites:
Summer Tasks: Read papers in https://cseweb.ucsd.edu/~haozhang/publication
Previous Project

Evaluation Strategies for Next-Generation AI Systems
Rajeev Chhajer, Ryan Lingo •
TA: TBA
A07 12 seats Mondays at 12pm-1pm PST Industry Partner: Honda Research Labs

As AI systems powered by Large Language Models (LLMs) become increasingly pivotal in high-impact scenarios, conventional metrics such as accuracy on public datasets are no longer sufficient. This domain revolves around exploring more flexible, context-aware methodologies for evaluating system performance across diverse dimensions, including depth of reasoning, ethical alignment, user experience, and practical effectiveness. Students will investigate and prototype varied assessment methods in multiple use cases and data environments, potentially incorporating automated audits to ensure continuous model alignment with organizational norms. By designing scalable evaluation pipelines and adapting them to specific application areas, participants will work toward developing AI systems that are more transparent, responsive, and dependable in evolving real-world contexts.

Read more

About: Rajeev Chhajer is the Chief Engineer at Honda Research Institute USA and leads the Software-defined Intelligence team at 99P Labs. He is a founding member of 99P Labs, a research initiative dedicated to developing sustainable technologies and innovative approaches to global challenges. His research focuses on smart city ecosystems, embedded systems, and connectivity to support sustainable and efficient mobility. Ryan Lingo is an Applied AI Research Engineer and Developer Advocate at 99P Labs. His work focuses on intelligent systems, with research interests in large language models, synthetic data, and applied machine learning. He has an academic background in philosophy and has held roles in data science and software engineering, with experience spanning academic research, industry, and early-stage startups.
Mentoring Style: We plan to take an engaged but student-led approach to mentoring. We'll work closely with the students throughout the project—meeting regularly, providing guidance, and being available for feedback and support. That said, we're looking for high-agency students who are excited to take ownership of their learning and direction. The best mental model for this capstone is "learning in public." Students will play an active role in shaping the plan and setting objectives. Rather than being given step-by-step instructions, they'll be encouraged to explore, make decisions, and figure out how to execute their ideas, with our mentorship to guide the way. We'll help them think critically, problem-solve, and communicate their process and outcomes clearly. While we won’t dictate tasks at a granular level, we’ll be present every week and ensure they have the support and structure they need to succeed.
Suggested Prerequisites:
Summer Tasks: Before the quarter starts, if you are interested in getting a head start, these three works offer advanced perspectives on evaluating large language models: • [Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/) by Hamel Husain This work dives into methods for assessing LLMs with a more holistic lens—addressing performance from qualitative angles that go beyond standard metrics. It challenges us to think about aspects such as contextual coherence and adaptability. • [A Field Guide to Rapidly Improving AI Products](https://hamel.dev/blog/posts/field-guide/) by Hamel Husain This guide provides practical strategies and frameworks for developing tailored evaluation pipelines. It emphasizes systematic, reproducible methods that capture nuanced performance characteristics and are well-suited for real-world scenarios. • [Task-Specific LLM Evals that Do & Don't Work](https://eugeneyan.com/writing/evals/) by Eugene Yan This piece critically examines traditional evaluation metrics and offers alternative, context-sensitive approaches to measuring LLM performance. It highlights the importance of incorporating reasoning depth, fairness, and other ethical considerations into our testing frameworks. These resources together reinforce a shift toward evaluation methods that are context-aware, scalable, and aligned with the intricacies of modern language models. Focusing on these ideas will help you design and prototype robust evaluation pipelines that can address the evolving challenges in AI. If you'd like to discuss the project further, please connect on LinkedIn with Ryan Lingo; he would be happy to discuss further.
Previous Project

Large Language (Multi-Modal) Model Reasoners and Agents
Zhiting Hu • zhh019@ucsd.edu
TA: TBA
A08 10 seats Tuesday 3-4PM

A central topic in Large Language or Multi-Modal Model research is to enhance their ability of complex reasoning on diverse problems. Rich research has been done to generate multi-step reasoning chains with LLMs, such as Chain-of-Thoughts (CoT), Reasoning-via-Planning (RAP), OpenAI o-series, etc. This capstone aims to explore the diverse reasoning approaches of LLMs (and/or large multi-modal models) and investigate improvement, applications, and scalable implementations of these approaches. For example: (1) Proposing new reasoning algorithms or improvement over existing reasoning algorithms in terms of performance; (2) Developing algorithmic and/or system innovations to scale up existing advanced reasoning algorithms; (3) Developing new agent frameworks for various applications like deep research, AI scientists, real-world embodied and social agents, etc.

Read more

About: Zhiting Hu is an Assistant Professor in Halicioglu Data Science Institute at UC San Diego. He received his Bachelor's degree in Computer Science from Peking University in 2014, and his Ph.D. in Machine Learning from Carnegie Mellon University in 2020. His research interests lie in the broad area of machine learning and artificial intelligence, with a focus on principles, methodologies, and systems of building AI agents that learn and reason with efficiency and generality. His current work centers on building general world models for next-generation machine reasoning and unified learning mechanisms for training machines with all types of experience. His research was recognized with outstanding paper awards at ACL 2016 and ACL 2024, and best demo nominations at ACL 2019 and NAACL 2024.
Mentoring Style: Students can either join the mentor's research group to work closely with PhD students/postdocs on relevant projects, or propose their own ideas and lead the projects. Students are expected to be independent, and mentor will provide necessary advices if needed (PhD students/postdocs can also provide more hands-on guidances).
Suggested Prerequisites:
Summer Tasks: [1] https://arxiv.org/abs/2404.05221 [2] https://arxiv.org/abs/2312.05230 [3] https://arxiv.org/abs/2305.14992
Previous Project

Community-Centered Discrimination Audits of LLMs - Bias Rapid Action Teams
Stuart Geiger • sgeiger@ucsd.edu
TA: TBA
A09 6 seats Wednesdays 10-11am

This capstone will work with community members to audit pretrained Large Language Models for discrimination and bias, using perturbation-based or controlled-experimental methods. These systematically vary a template prompt along a potential type of discrimination, then observe differences in outputs. For example, if you ask ChatGPT (or TritonGPT) to act as a college admissions reviewer, does an application's score change if it references the Mens vs Womens basketball team? Or being on the lacrosse versus basketball team? Or being from La Jolla versus San Ysidro? These methods are relatively simple from a statistics perspective, but the hard part is knowing what kinds of discrimination are of most concern to the people who will be impacted by model outputs and creating real-world template prompts that test for those concerns. This capstone will be centered around **talking and listening to real people** about their concerns with LLMs in real-world contexts, then using our data science expertise in a more consulting-style mode. If a team chooses university admissions, they might work with students, high school counselors, professors, and/or admissions staff. All students must take and pass the 3-hour UCSD/CITI IRB Human Subject Protection Training online course (Social and Behavioral Basic) by week 3 of Fall.

Read more

About: I’m a social scientist with a background in the humanities, especially history and philosophy of science and technology, but I have enough expertise in computer science and data science to make trouble. I believe that data science systems should be fair, transparent, and accountable to the public, but that most are currently not. A lot of my research is in community-centered content moderation NLP systems for user-generated content, especially Wikipedia, where I formerly worked on their ML models and systems.
Mentoring Style: I will be the point of contact and there every week, but may bring in collaborators and my grad student advisees. I intentionally do not run a "lab", but I do have a "constellation of collaboration." Students can choose their own particular context in which LLMs are deployed and which kinds of community members / impacted people they want to consult.
Suggested Prerequisites:
Summer Tasks: A recent example of a perturbation-based audit study: - https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0318500 For more readings, see - https://auditlab.stuartgeiger.com Take UCSD/CITI IRB Human Subject Protection Training online course (Social and Behavioral Basic), must complete by week 3 of Fall, but good to do it earlier. Register at citiprogram.org and see this video for how to register: - https://www.youtube.com/watch?v=hOAgfK93QXg
Previous Project

Training Baby Language Models from Scratch
Alex Warstadt • awarstadt@ucsd.edu
TA: TBD
A10 6 seats Tuesday 11am

Large Language Models have an impressive ability to learn and use human language, but humans are still the state-of-the-art when it comes to learning language efficiently. We acquire language from 100 million words or less, whereas LLMs are now trained on 10s or *trillions* of words. The BabyLM Challenge (https://babylm.github.io/) is a competition centered around training small "BabyLMs" under constraints inspired by human language learning. The goal of a BabyLM submission is to train a model that learns language as data-efficiently as a human or that simulates properties of human learning and linguistic performance.

Read more

About: Alex Warstadt is an Assistant Professor at UCSD with joint appointments in HDSI and the Department of Linguistics. He received his PhD in linguistics in 2022 from NYU under Samuel Bowman, and completed a postdoc in 2024 with Ryan Cotterell. Alex runs UCSD's Learning Meaning and Natural Language Lab (LeMN Lab) which is an interdisciplinary group that uses insights from linguistics to advance and interpret language models and uses advances in machine learning to answer scientific questions about the nature of language.
Mentoring Style: This topic lends itself to multiple parallel projects, but a single large project is acceptable as well depending on group size. PhD students in LeMN Lab will be encouraged to co-supervise projects.
Suggested Prerequisites:
Summer Tasks: Replicate prior BabyLM studies, including training LMs from scratch and evaluating, launching jobs on a UCSD supercomputer cluster such as Nautilus, and tracking training progress and sweeps over several days or weeks using weights and biases.
Previous Project

Explorations on in-context learning in LLMs
Prof. Arya Mazumdar • amazumdar@ucsd.edu
TA: TBD
A11 8 seats Tuesdays at 1pm

In-context learning is a distinctive capability of large language models (LLMs) that enables them to perform tasks by leveraging examples or instructions provided directly within the input prompt, without any parameter updates or explicit fine-tuning. This means the model can adapt its behavior based on the context given in a single interaction, such as demonstrating a task through a few examples (few-shot learning) or simply providing a clear instruction (zero-shot learning). In-context learning allows LLMs to generalize across tasks and domains, making them highly flexible and efficient tools for a wide range of applications, from language translation to code generation. This approach contrasts with traditional machine learning paradigms that require model retraining or fine-tuning to incorporate new information. In-context learning raises several fascinating theoretical questions, such as: 1) Mechanism of Learning Without Weight Updates: How do LLMs perform complex reasoning or task adaptation based purely on input text, without changing internal weights? What internal representations support this kind of "learning"? 2) Implicit vs. Explicit Learning: To what extent does in-context learning mirror classical learning paradigms like gradient descent? Are LLMs simulating optimization procedures internally when given examples? 3) Limits of Generalization: What are the boundaries of what in-context learning can achieve? For instance, how well can models generalize to truly novel tasks or concepts they’ve never seen, even as examples?

Read more

About: Arya Mazumdar is a Professor at the University of California, San Diego. His research interests include coding theory, information theory, statistical learning, and distributed optimization. He was a recipient of multiple awards, including the Distinguished Dissertation Award for Ph.D. Thesis in 2011, NSF CAREER Award in 2015, EURASIP Best Paper Award in 2020, and IEEE ISIT Jack K. Wolf Student Paper Award in 2010. He was a Distinguished Lecturer of the Information Theory Society for 2023-24, and currently serves as an Associate Editor for IEEE TRANSACTIONS ON INFORMATION THEORY, an Area Editor for Now Publishers Foundation and Trends in Communication and Information Theory series, and an Action Editor of Transactions on Machine Learning Research.
Mentoring Style: Every week students must report progress. Every single individual should clearly articulate what they did the week.
Suggested Prerequisites:
Summer Tasks: Read the above topics
Previous Project

Quantifying the credibility of large language model outputs
Yian Ma • yianma@ucsd.edu
TA: TBA
A12 8 seats Thursday at 10am

Accurate uncertainty quantification in large language models (LLMs) is essential for providing credible confidence estimates over their outputs. We will explore how to quantify the uncertainty and credibility of the LLM outputs. Specifically, we will explore prompt perturbation methods over open source LLM models.

Read more

About: I am an assistant professor at the Halıcıoğlu Data Science Institute and an affliated faculty member at the Computer Science and Engineering Department of University of California San Diego. Prior to UCSD, I spent a year as a visiting faculty at Google Research. Before that, I was a post-doctoral fellow at EECS, UC Berkeley. I completed my Ph.D. at University of Washington and obtained my bachelor's degree at Shanghai Jiao Tong University.
Mentoring Style:
Suggested Prerequisites:
Summer Tasks: Please read the following papers and prepare to lead a discussion on them at the beginning of the fall quarter: https://arxiv.org/pdf/2403.02509 https://arxiv.org/pdf/2406.02543 https://openreview.net/pdf?id=p6jsTidUkPx
Previous Project

🧠 Theoretical Foundations

(back to the outline)

Communication Complexity
Shachar Lovett • slovett@ucsd.edu
TA: TBA
A13 4 seats Mondays 1-2pm

Communication complexity is a field of theoretical computer science that studies the amount of communication required between two or more parties to jointly compute a function whose input is distributed among them. In its most basic model, two players—Alice and Bob—receive inputs x and y respectively, and want to jointly compute some function f(x,y) using as little communication as possible. The focus is on minimizing the number of bits exchanged, under various models (deterministic, randomized, or quantum), while still correctly computing the function. This model has been studied extensively since its introduction in the 80s, with many notable applications, but still there are many fundamental problems that are wide open. These include the log-rank conjecture, which attempts to classify the structure of efficient deterministic protocols using linear algebra; understanding the combinatorial structure of functions with efficient randomized protocols; analogs of these to multi-party settings; and many others. Moreover, problems in communication complexity tend to have strong connections to fundamental problems in combinatorics and other areas of math, which sometimes gives a unique perspective to help understand these problems better.

Read more

About: I work broadly in theoretical computer science, with a specific interest in computational complexity, randomness and pseudo-randomness, algebraic constructions, optimization, and combinatorics.
Mentoring Style: Initially the capstone project will run standalone, as students learn the area. Then, depending on their interest, we can potentially join an existing research project or they work on their own research project.
Suggested Prerequisites:
Summer Tasks: The book "Communication Complexity and Applications" by Rao and Yehudayoff is a good starting point. See also the following surveys for more recent progress related to my work: Recent advances on the log rank conjecture: http://bulletin.eatcs.org/index.php/beatcs/article/download/260/245 Models of computation between decision trees and communication: https://escholarship.org/content/qt80d93481/qt80d93481.pdf
Previous Project

Understanding deep learning through feature learning
Tianhao Wang • tianhaowang@ucsd.edu
TA: TBA
A14 8 seats Friday

The success of deep learning is often attributed to the ability of neural networks to learn useful features from data. Yet, the process of feature learning remains mysterious. In this project, we aim to develop fundamental understanding of how neural networks learn features, with an emphasis on the dynamical perspective of the training process. We will explore feature learning for various neural network architectures, and investigate surprising phenomena in deep learning such as implicit regularization and grokking, etc. Participants will gain hands-on experience in training neural networks and analyzing the training process.

Read more

About: Tianhao Wang will be joining the Halıcıoğlu Data Science Institute at UC San Diego as a tenure-track assistant professor in July 2025. He received his PhD from the Department of Statistics and Data Science at Yale University in 2024. Prior to Yale, he received his BS from University of Science and Technology of China in 2018. His research interests are mainly in theoretical foundations of statistical learning, especially high-dimensional statistics and deep learning.
Mentoring Style: Participants should feel free to explore on their own if they find a specific topic interesting. Meanwhile, participants will have opportunities to work with my PhD students. I will provide hands-on guidance.
Suggested Prerequisites:
Summer Tasks: It would be helpful to get familiar with deep learning frameworks such as PyTorch or Jax.
Previous Project

Transformers for graph learning
Yusu Wang and Gal Mishne • yusuwang@ucsd.edu; gmishne@ucsd.edu
TA: TBD
A15 16 seats Wed morning 9am preferred

Graph data is ubiquitous in a broad range of applications. There are two major families of graph learning models: graph neural networks, and transformer-based graph learning models. However, while graph neural networks, such as message-passing graph neural networks, can naturally handle graph-type data (both nodes, edges, and features along nodes and edges), standard transformer architectures are designed for sequences or sets, not for the relational type of data like graphs. There thus has to be a way to encode the graph structures into transformers. There are various design choices. In this capstone project, we would like to explore the pros and cons of different design choices of encoding graph information over a collection of graph tasks.

Read more

About: Yusu Wang: geometric deep learning, graph learning, neural algorithmic reasoning, topological data analysis Gal Mishne: Manifold learning, Diffusion geometry, Computational neuroscience, Image processing and graph signal processing, and Applied harmonic analysis.
Mentoring Style: In the first few weeks we will give some "lectures" on the background, together with reading / experimenting materials. Usually students form groups of around 3 students each to develop the capstone projects.
Suggested Prerequisites:
Summer Tasks: - https://arxiv.org/abs/2205.12454 - https://arxiv.org/abs/2302.04181
Previous Project

Theoretical Computer Science
Barna Saha • barnas@ucsd.edu
TA: TBD
A16 4 seats Early morning on Friday

Reading theoretical papers related to fine-grained complexity and learning how to prove theorems. Fine-grained complexity seeks to explore the complexity of algorithms beyond the traditional coarse distinction between polynomial-time and NP-hard problems. The central idea is to investigate the relationships between different computational problems and identify those that are "hard" in a more fine-grained sense.

Read more

About: Check here: https://en.wikipedia.org/wiki/Barna_Saha
Mentoring Style: The students would need to be independent.
Suggested Prerequisites:
Summer Tasks: Read relevant papers that appeared in recent STOC/FOCS conference
Previous Project

Simulation coding exercises for teaching probability theory
Peter Chi • pbchi@ucsd.edu
TA: TBA
A17 8 seats Mondays 3-4pm

Within the past two decades, simulation-based inference has established itself as a standard approach for teaching an introductory statistics or data science course (such as DSC 10). While it has been argued that simulation-based pedagogies should likewise be useful in a probability theory course, the implementation therein is not currently well developed, nor is this notion even universally accepted to date. Students in this domain will explore this by developing coding exercises that are designed to teach concepts from a typical undergraduate probability theory course (such as MATH 180A, MATH 183, and MATH 181A). Specifically, the coding exercises that capstone students in this domain will create as part of their projects will task probability students with writing simulation code that illustrates a particular concept or theoretical result in a probability course. Possible deliverables at the end of projects in this domain could be each of the following, or other related/comparable items proposed by capstone students in this domain: (1) a set of coding exercises that each target a specific topic in a probability theory course; (2) instructor lesson notes for each coding exercise that details how it could be implemented in a typical course and its pedagogical rationale (i.e. why we believe it should be effective); (3) solutions to each coding exercise, in both R and Python; (4) Shiny apps (written in either R or Python) for each coding exercise that visually and interactively demonstrate the solution code in action; (5) assessment questions to test students on their resulting understanding of the probability concepts aimed to be taught by each of the coding exercises, and their answers; (6) statistical analyses on data collected via a designed experiment by capstone students, directly investigating the efficacy of their created coding exercises for teaching probability concepts. Capstone students in this domain should plan to take the CITI Human Subjects Research training: Social and Behavioral Research, administered online (and at no cost) by the UCSD IRB office, early in Quarter 1 in preparation for collection of real data for item (6) above during Quarter 2.

Read more

About: I received my PhD from the Department of Biostatistics at the University of Washington. My primary research focus is statistical phylogenetics, with secondary interests in statistics education and in casino games of chance. Prior to joining the faculty at HDSI, I was an Associate Professor of Statistics at Villanova University -- the alma mater of one pope and three future NBA champions (I am writing this on May 17th 2025 as the NY Knicks have just won their Eastern Conference Semifinal series, and I'm calling it now).
Mentoring Style: I will mentor this domain solo (without assistance from graduate students). I aim to give students the background and confidence to take ownership over their projects, and will likely be fairly hands-on at first: in particular, I will lead discussions on concepts in the field of statistics education, and give instruction on how to write Shiny apps as needed. I will also cover principles of experimental design and relevant statistical analyses as needed.
Suggested Prerequisites:
Summer Tasks: (1) Review concepts from whichever probability course(s) that you have taken; (2) Explore the Shiny apps here to get an idea of what is possible with Shiny apps for teaching statistics: https://statistics.calpoly.edu/shiny; (3) Read this paper (although its aims were different from what ours will be, it is one of the only scientific research papers to date that addresses the idea of using simulation in a probability theory course, so it will be good to be familiar with what they have done and the issues that they raise): https://www.tandfonline.com/doi/full/10.1080/10691898.2019.1600387
Previous Project

Applied Data Science

(back to the outline)

NLP Credit Score Development
Brian Duke, Kyle Nero •
TA: TBA
B01 12 seats Thursday 1-2p Industry Partner: Prism Data

One of the most widely used and little understood parts of the Financial Services industry is the credit score. In this course, students will work with transactional bank data to build statistical models for the purpose of assessing creditworthiness in the financial services industry. The course will take students through the life of a model development project, from data exploration, through model training and evaluation. Students will have the opportunity to work with both structured and unstructured data as they learn about the process and attributes that go into credit scores. Additionally, students will learn about the importance of model explainability and fairness.

Read more

About: Brian Duke has been a data scientist for 23 years in the Financial Services industry. He has worked at Capital One, FICO, SAS Institute, Accenture, Experian, Petal Card and currently is the Head of Data Science at Prism Data. A common theme in his work has been translating transactional data into useful scores and analytical insights for use in risk decisioning. Brian received his BA and MS from the University of California, San Diego and continues to reside in the San Diego area today. He holds 4 patents and has 12 pending in the United States. Kyle Nero graduated from UCSD HDSI in 2023, majoring in data science and minoring in business. During his senior year, he engaged with industry partner Prism Data through the HDSI Senior Capstone Project. He went on to intern with Prism Data following his graduation and joined the team full time as a Data Scientist in September 2023.
Mentoring Style: Our group will consist of group projects completed in groups of 3-4. The goal of the course is to eventually build a credit score but we will start by building a transaction categorization model using NLP techniques. Each week we will talk about techniques that can be applied to the next step in the project. We will begin by reviewing homework from the previous week and discussing ideas. Then introduce the next step and talk about what can be done to solve the next step in the problem. The goal is to introduce students to the model development process in most financial services companies.
Suggested Prerequisites:
Summer Tasks: https://www.capitalone.com/learn-grow/money-management/when-did-credit-scores-start/, https://www.capitalone.com/learn-grow/money-management/fair-credit-reporting-act/, https://www.capitalone.com/learn-grow/money-management/equal-credit-opportunity-act/, https://www.nerdwallet.com/article/finance/credit-score-ranges-and-how-to-improve and https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=An%20ROC%20curve%20(receiver%20operating,False%20Positive%20Rate
Previous Project

Mining Privacy Designs in the News
Haojian Jin • h7jin@ucsd.edu
TA: TBA
B02 10 seats Tuesday afternoon

Mining Privacy Designs in the News aims to uncover the patterns, strategies, and tensions in how digital privacy is represented across news media. By analyzing a large corpus of news articles, this work surfaces recurring narratives, metaphors, and framings that shape public understanding of privacy. Through this lens, we can trace how societal norms, stakeholder agendas, and cultural anxieties are reflected in—and influenced by—media discourse, offering a foundation for more responsive design, policy, and public engagement.

Read more

About: Human developers create risky computer systems that eventually affect human users. Our lab, Data Smith Lab, takes a human-centered approach to (1) help developers create systems with enhanced privacy and security features and (2) help users safeguard their privacy and security.
Mentoring Style: We will have weekly meetings. My phd students will also help the team. The ultimate goal of the research project is to produce a research paper.
Suggested Prerequisites:
Summer Tasks: https://www.haojianj.in/resource/pdf/sketchprivacy.pdf
Previous Project

Digital twin model for health with wearable data
Tauhidur Rahman • trahman@ucsd.edu
TA: TBA
B03 8 seats Monday AM. We will figure a specific timeslot out when the quarter gets near and we know all of our schedules better.

The biotech and biomedical industries rely heavily on randomized controlled trials (RCTs) to evaluate the efficacy and safety of medical interventions, yet these trials are prohibitively expensive and time-consuming—costing an average of $1.3 billion per drug and spanning 10–15 years for market approval in the pharmaceutical sector. Similarly, preclinical animal testing adds significant financial and ethical burdens, with failure rates exceeding 85% when translating results to humans. In this proposal, we seek to address these inefficiencies by advancing digital twin technologies as a transformative paradigm in biomedical research and development. Digital twins—AI-driven computational replicas of physiological systems—have the potential to accelerate innovation by providing scalable, cost-effective, and ethically viable alternatives to conventional RCTs. They can simulate disease progression, predict patient responses to interventions, and enable in silico testing of therapeutics, reducing reliance on human and animal trials. However, major gaps hinder their widespread adoption, including (1) lack of standardized, generalizable representations of physiological systems across individuals and timescales, (2) computational inefficiencies in personalizing models for diverse populations, and (3) limited capacity for knowledge transfer between different interventions and treatment domains. This project aims to bridge these gaps by establishing foundational methods for scalable, personalized, and computationally efficient digital twins. Through advanced physiological modeling, knowledge graph integration, and large language model (LLM)-augmented frameworks, we will develop novel approaches to enhance prediction, personalization, and decision-support in biomedical innovation. Our proposed methods will not only improve the efficiency and reliability of digital twins but also provide a generalizable infrastructure applicable across diverse medical domains, including diabetes management, kidney dialysis, and substance use disorder interventions. To address these challenges, we propose three core research thrusts: (i) Representation of Physiological Systems for Closed-Loop Modeling: We will develop featurization and representation learning techniques that generalize across diverse biomedical data types (from physiological signals to symptoms and comorbidities), ensuring adaptability across time scales (minutes to years) and population levels (individuals to cohorts). (ii) Computationally Efficient Digital Twins with Personalization Mechanisms and Knowledge Graphs: We will design scalable, personalized digital twin models that dynamically identify similar individuals within a population to enhance forecasting accuracy for physiological states based on specific interventions. Additionally, we will integrate knowledge graphs to learn relationships between different interventions, allowing the model to extrapolate to novel treatment strategies. (iii) Demonstration of a Generalizable Large Language Model (LLM)-Augmented Digital Twin Framework and Dashboard: We will implement and validate a scalable, interactive framework to support digital twin experimentation in multiple biomedical domains, including diabetes management, kidney dialysis, and substance use disorder interventions.

Read more

About: I direct the Mosaic lab.
Mentoring Style: I along with my PhD student will attend the meeting every week. My mentoring style is really to help you to know the right tools, help you to identify the intermediate goals and eventually ask challenging questions. Eventually the success will largely depend on you and your approach to tackle complex questions with lots of uncertainty. I want you to take leadership (almost consider this as an opportunity to write your paper or build your startup, just as an example) and I will be there to help you to navigate the complex landscape of uncertainty in research.
Suggested Prerequisites:
Summer Tasks: Enhance your familiarity with LLM, Deep learning, Generative AI tools and techniques.
Previous Project

Developing lessons to ease teachers into time series data science
Benjamin Smarr • bsmarr@ucsd.edu
TA: TBA
B04 8 seats Wednesdays 3pm

Data science skills can be transformative, especially when deployed in communities without historical access to digital resources. UCSD supports tools that make access to data science free, but without lessons aimed at helping teachers make confident use of these tools, most teachers won't get the potential of data science training, and won't want to offer curricula to their students. Our goal will be to develop light weight lessons that take someone from zero knowledge to the ability to load small data sets into JupyterLite webpages and carry out basic analyses and visualization, so that they can feel good showing others in their community how these skills could help their students.

Read more

About: Prof. Smarr comes from a neuroscience and biological rhythms background. His lab focuses on using longitudinal data sources to develop novel analytics that reveal biologically relevant information from these data, framed by an understanding of the way biological data tend to change at different timescales. This is sometimes naturalistic, but more often related to biomedical algorithm development.
Mentoring Style: You will work mostly within your group, with some guidance from graduate students who have done related work. You will also meet weekly with Prof. Smarr to assess progress and shape next steps.
Suggested Prerequisites:
Summer Tasks: Read https://pmc.ncbi.nlm.nih.gov/articles/PMC5301430/ and https://pubmed.ncbi.nlm.nih.gov/35870975/, and develop some familiarity recreating the figures and analyses.
Previous Project

Deep Learning for Climate Model Emulation
Duncan Watson-Parris • dwatsonparris@ucsd.edu
TA: TBA
B05 6 seats Wednesday 2-3pm

The choices humanity makes in the next few decades will determine how much warmer the Earth will be by the end of the century, with implications for billions of lives and trillions of dollars in GDP. Many different emission pathways exist that are compatible with the Paris climate agreement, and many more are possible that miss that target. While some of the most complex climate models have simulated a small selection of these, it is impractical to use these computationally expensive models to fully explore the space of possibilities or assess all the associated risks. Our lab has recently developed state-of-the-art climate model emulators to enable fast, accurate and reliable predictions for any given scenario (https://github.com/duncanwp/ClimateBench). This project will extend this work by incorporating multiple climate models at different levels of fidelity to provide high-resolution predictions with robust uncertainties for improved decision making.

Read more

About: Duncan Watson-Parris is an atmospheric physicist working at the interface of climate research and machine learning. The Climate Analytics Lab (CAL) he leads focuses on understanding the interactions between aerosols and clouds, and their representation within global climate models. CAL is leading the development of a variety of machine learning tools and techniques to optimally combine a variety of observational datasets, including global satellite and aircraft measurements, to constrain and improve these models. Duncan is also keen to foster the application of machine learning to climate science questions more broadly and convenes the Machine Learning for Climate Science EGU session and co-convenes the “AI and Climate Science” discovery series that is part of the United Nations’ AI for Good program.
Mentoring Style: This work is central to CAL so I will meet personally with the students, and it can be integrated into the broader research program to the extent the students want to engage with it. The students will be welcome to join our Lab meetings (typically held at Scripps Institution of Oceanography).
Suggested Prerequisites:
Summer Tasks: - Skim the latest UN Intergovernmental Panel on Climate Change Synthesis Report to get a summary of the latest climate change science, especially the figures: https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_SPM.pdf - Read the ClimateBench paper: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2021MS002954 - Try out the xarray Python library for working with climate data: https://docs.xarray.dev/en/stable/
Previous Project

Analysis of Temporally Varying Point Cloud using Optimal Transport
Alex Cloninger, Rayan Saab • acloninger@ucsd.edu, rsaab@ucsd.edu
TA: TBA
B06 8 seats Mondays 2-3pm

Time varying point clouds appear in a number of important applications. These range from Motion Capture (MOCAP) data, to molecular and particle dynamics, to crowd and swarm dynamics. In these applications, each "datum" of interest is a multi-dimensional time series of a large number of points over a long number of time steps, and the questions associated are how to cluster and classify these data, or how to generate new examples. Unfortunately, analysis of these problems can be quite complex. Fundamentally, this boils down to three issues: 1) lack of point-to-point registration from one time scale to a later time, 2) the lack of time series tools for dealing with high dimensional time series, especially when the data is not in a simple Euclidean vector space, and 3) the sheer size of the data storage / computation for just one time series example. The domain of this project will cover methods in comparing these point clouds as coming from distributions that are time varying, and thinking about analysis of these distributions. One tool we will use for these analyses is optimal transport, which can benefit the problem both theoretically and computationally. We will also consider deep learning and signal processing approaches to these types of data. Students who choose this project will delve into the mathematical and computational problems of these data types, utilizing tools from probability and statistics, signal processing, and linear algebra. They will also engage in hands-on coding and experimentation on algorithms for optimal transport and time series models, testing them on various data sets.

Read more

About: Alex Cloninger is an Associate Professor in Mathematics and the Halicioglu Data Science Institute. He works on computational models for learning similarities between data, and using these similarity measures to solve various scientific problems. Rayan Saab is a Professor in the Mathematics Department and at the Halicioglu Data Science Institute. He works on developing computational methods and theory for solving problems related to collecting, processing, and analyzing data. He came to this work first through an undergrad degree in electrical engineering and finding himself always interested in both making things work and understanding why they do.
Mentoring Style: We both are relatively hands-on in the sense that we make ourselves available for problem-solving and discussions. That said, students have to be self-motivated, and motivated to do the readings and the work.
Suggested Prerequisites:
Summer Tasks: Here are some relevant readings / videos. Students need not go into the mathematical details as we can go through them together, but these papers give an idea of the different approaches and applications. The more familiar you are with the topic, the more we can do! **Optimal Transport** - https://www.youtube.com/watch?v=EauDdCzxphE - https://www.youtube.com/watch?v=mITml5ZpqM8 **Example Applications with Temporal Point Clouds** - https://mocap.cs.sfu.ca/ - https://www.ipb.uni-bonn.de/data/4d-plant-registration/ To be able to obtain really nice experimental results, you'll need to pick up PyTorch and also the POT: Python Optimal Transport toolbox.
Previous Project

ALERTCalifornia - Extreme Events Detection
Nathan Hui , Falko Kuester, Neal Driscoll • nthui@ucsd.edu, fkuester@ucsd.edu, ndriscoll@ucsd.edu
TA: TBD
B07 4 seats Tuesday 11am

The ALERTCalifornia research program continues UC San Diego’s more than 20-year legacy of collecting high-quality data through a network of natural hazard monitoring and detection cameras across the state. This growing network includes over 1,150 camera sensors that provide real-time imagery. They are located in wild spaces, on towers, and other high points across the entire state of California, and are used to watch for and monitor extreme events including wildfire and weather. The program’s historical archive of camera data contains over 38 billion timestamped and localized frames. These camera data have facilitated CALFIRE’s ability to rapidly respond to emerging wildfires as well as maintain situational awareness during ongoing wildfires and other natural disasters. We would like to investigate where machine learning techniques can assist with assessing camera network health, data integrity, and environmental signals. Potential projects include camera site uptime detection, cloud detection, marine layer height detection, Visual Flight Rules altitude estimation, horizon detection, or camera positioning calibration.

Read more

About: Nathan Hui is currently a research engineer at UC San Diego at the Qualcomm Institute. His area of focus is multi-domain robotics, 3D imaging, and distributed sensor networks. Previous projects include tracking transmittered wildlife using drones, measuring physical oceanographic data using intelligent surfboard fins, and measuring fish length using low-cost lasers, dive cameras, and machine learning. Prof. Kuester received an MS degree in Mechanical Engineering in 1994 and MS degree in Computer Science and Engineering in 1995 from the University of Michigan, Ann Arbor. In 2001 he received a Ph.D. from the University of California, Davis and currently is the Calit2 Professor for Visualization and Virtual Reality at the University of California, San Diego. Professor Kuester holds appointments as Professor in the Departments of Structural Engineering and Computer Science and Engineering at the Jacobs School of Engineering (JSoE) and serves as the director of the Cultural Heritage Engineering Initiative (CHEI), the Center of Interdisciplinary Science for Art, Architecture and Archaeology (CISA3), the Calit2 Center of Graphics, Visualization and Virtual Reality (GRAVITY) and the DroneLab. Neal Driscoll is a professor of geology and geophysics in the Geosciences Research Division at Scripps Institution of Oceanography at UC San Diego. Driscoll researches tectonic deformation and the evolution of landscapes and seascapes. His work primarily focuses on the sediment record to understand the processes that shaped the earth. As part of this research, Driscoll spends time at sea acquiring images of the seafloor and subsurface layers to understand the processes that shape Earth. Driscoll is also co-director of UC San Diego’s Center for Public Preparedness (CP2) and the ALERTCalifornia public safety program. ALERTCalifornia provides critical infrastructure for mitigating wildfire and natural disaster risk to life, property and ecosystems. The advanced network of more than 1000 cameras across California helps first responders monitor natural disasters such as wildfires, floods, and landslides. ALERTCalifornia is a vital resource that provides an array of technological tools, infrastructure and research that supports government agencies, utilities and the public in their response to ever-increasing natural disaster risk. ALERTCalifornia also gathers vital data to inform the greater understanding of natural disaster causes, active event behavior and post-event impacts to air quality, water quality, ecosystems, and human health.
Mentoring Style: We can facilitate mentorship in our facilities (Atkinson Hall). This will occur as part of our research group (regular meetings), with additional oversight under ALERTCalifornia (milestone updates).
Suggested Prerequisites:
Summer Tasks: Students should be able to utilize Nautilus NRP, be familiar with active learning techniques, semi-supervised or unsupervised learning, and utilizing web APIs. We expect most work to be done in Python. Please also be familiar with Docker, Poetry, Kubernetes, and Tornado.
Previous Project

Large Language Models in Healthcare
Aaron Boussina; Karandeep Singh • aboussina@health.ucsd.edu; karandeep@health.ucsd.edu
TA: TBA
B08 6 seats Monday, 10am

The release of GPT-4 in 2023 captured global attention when it demonstrated the ability to pass the United States Medical Licensing Examination (USMLE). Since then, Large Language Models (LLMs) have started to transform many aspects of healthcare, including how patients access medical information, how clinicians document care, and how payers and regulators manage and review clinical workflows. This surge in capability has fueled a wave of startups and initiatives—such as Hippocratic AI—that aim to deploy LLMs for high-stakes, patient-facing applications. However, the complexity, heterogeneity, and high-risk nature of the medical domain present unique challenges that general-purpose AI systems are not inherently equipped to handle. Critical questions remain about when LLMs can be considered safe for clinical use, how to systematically evaluate their performance on bespoke medical tasks, and how to integrate them into healthcare operations in a way that enhances rather than undermines clinical quality and patient safety. In this domain, students will build, adapt, and rigorously evaluate LLM-based systems for healthcare applications. Projects will focus on operationalization challenges such as ensuring factual accuracy, mitigating hallucinations, designing appropriate evaluation frameworks, and aligning model outputs with clinical standards and ethical considerations. Students will gain hands-on experience with both the technical aspects of model development and the practical challenges of deploying AI responsibly in healthcare settings.

Read more

About: Karandeep is a physician-scientist with expertise in the evaluation and implementation of statistical and machine learning models into the clinical and operational context. His research lab’s focus is on understanding translational issues of bringing AI into clinical practice, including transportability and generalizability issues, dataset shift, and clinical and operational outcomes. He serves as Chief Health AI Officer for the UC San Diego Health System and has a leadership role in the Jacobs Center for Health Innovation. He has >90 peer-reviewed publications focused primarily on machine learning, digital health, and natural language processing. The core focus of Aaron’s research is the development and implementation of predictive and generative systems in healthcare settings. His recent work on reducing sepsis-related mortality using deep learning was featured in Nature Digital Medicine, Fortune, KPBS, and referenced in the Bipartisan House Task Force Report on AI. His recent work in the New England Journal of Medicine AI was the first publication to explore the use of generative AI to automate and scale costly documentation for hospital quality measurement. To enable code-to-clinic contributions, his research combines multiple disciplines including software engineering, deep learning, healthcare informatics, and implementation science.
Mentoring Style: Capstone students will be integrated into the Jacobs Center for Health Innovation (JCHI) research group and will lead an independent project. Students will be expected to manage their projects, develop their software, test their hypotheses, and submit a peer-reviewed paper by the end of the course (with mentorship).
Suggested Prerequisites:
Summer Tasks: Papers: - https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2781307 - https://ai.nejm.org/stoken/default+domain/VIG3W4P4GC3SIQASJF4D/full?redirectUri=doi/full/10.1056/AIcs2400420 Packages: - Transformers - vLLM
Previous Project

Hunting for Ghost Particles - Analyzing Time Series Data produced by Semiconductor Detectors
Aobo Li • aol002@ucsd.edu
TA: TBD
B09 4 seats Wednesday 3-4

Neutrinos are tiny particles that are almost like ghosts because they can pass through just about anything without being noticed. They're produced in huge numbers by the sun and other stars, but catching them is really tough because they hardly ever interact with other matters. Scientists use special, super sensitive equipment such as Semiconductor Detector to try and spot these sneaky particles and learn more about how the universe works. The Majorana Demonstrator experiment utilizes an array of these Semiconductor Detectors to capture neutrinos hidden in the time series data generated by these detectors. In this project, we will establish an analysis team dedicated to examining this time series data. The team will undertake multiple analytical tasks, including employing machine learning models for time series classification and regression, aiming to produce an energy spectrum akin to the one generated by the Majorana Demonstrator.

Read more

About: I am Aobo Li (you can call me obo, like the musical instrument). I am a new faculty at HDSI & the Department of Physics. I earned my B.S. from UW Seattle and my PhD from Boston University, both in the field of physics. My research uses machine learning to squeeze out the maximum amount of information from ultra-sensitive radiation detectors, all in the quest to uncover extremely rare physics events in our universe.
Mentoring Style: To achieve our final analysis goal—the detector spectrum—students will need to construct and train 3–5 machine learning models using a fully labeled dataset. One of these models will address a regression task, while the others will tackle binary classification, using 0/1 labels. An Analysis Coordinator (AC) will oversee the entire model-building process and document everything in a unified analysis document. Within the project, we will form subgroups; each will select a machine learning task, propose a model to accomplish it, and provide weekly updates during meetings to track progress. The AC and I will engage with each student weekly to discuss their tasks and provide feedback on their updates. Additionally, students will receive detailed assistance from the AC on coding and technical aspects, whereas I will focus on providing in-depth guidance to the AC.
Suggested Prerequisites:
Summer Tasks: The Majorana Demonstrator data we will analyze is already available online. Data Download Website: https://zenodo.org/records/8257027 Data Release Notes: https://arxiv.org/pdf/2308.1085 All students who wish to get involved in this project should make sure to read the Data Release Notes carefully. Students should also try to download the data and make sure they can extract information from it (the data is stored in .hdf5 file format). Machine Learning Prerequisite - Students should make sure they can design, run and validate machine learning models for classification and regression tasks, ideally using PyTorch to build and train simple neural networks. During the data analysis process, students will have the freedom to pick their own models to use. Analysis Coordinator - One of the enrolled students will be elected as the analysis coordinator (AC) of this project. AC will take a leadership role to coordinate model development among different subgroups and manage this project at a higher level. This will be an excellent leadership experience that can be highlighted on a student's CV. If you are interested in this position, please send an email to aol002@ucsd.edu. If no one volunteers, the advisor will appoint one student as the AC. Please be prepared to serve as the AC if you enroll in this project. Additional reading - - Nachman Undergraduate Thesis: https://drive.google.com/file/d/1oF8oiGke5SCVbKTbbPlNwxh9zYN_Nri4/view?usp=sharing (Please pay special attention to Section 3: Pulse Shape Parameter Pipeline) - Majorana Demonstrator Experiment: https://phys.org/news/2023-02-legacy-majorana.html
Previous Project

Blockchain
Rajesh K Gupta • rgupta@ucsd.edu
TA: TBD
B10 6 seats Wed 10-11 (Saturday 10AM is a standing alternative)

Blockchains provide a platform for developing new distributed programs and workflow that provide for various services. It is particularly suited for services that involved asynchronous collaboration of diverse actors (human or agents) to achieve overall system objectives. Among the key capabilities are verifiability, non-volatility/immutability of various transactions as well enforcements of various dependencies in a provably correct manner. In this capstone project, you will explore one such service, design and implement it using smart contracts on a chosen platform (Solidity/Ethereum, Solana, Hyperledger etc). You may also consider building upon past projects such as those for GymCoin, RealEstate, etc.

Read more

About: Rajesh Gupta serves as a founding director of the Halıcıoğlu Data Science Institute and as a distinguished professor of Computer Science and Engineering at UC San Diego. His research is in embedded and cyber-physical systems with a focus on sensor data organization and its use in optimization and analytics. Prof. Gupta holds Qualcomm Endowed Chair in Embedded Microsystems at UC San Diego and INRIA International Chair at the French international research institute in Rennes, Bretagne Atlantique. He is a Fellow of the IEEE, the ACM and the American Association for the Advancement of Science (AAAS).
Mentoring Style: My mentoring is to listen your progress and plans on a weekly basis and lead you to think through alternatives.
Suggested Prerequisites:
Summer Tasks: Please study, review basics of Blockchain. Since the Smart Contract programming ecosystem is evolving, please research and practice with potential development platform for your project. You may look at the past projects for suggestions.
Previous Project

HCI and Reinforcement Learning for Hearing Loss Compensation
Harinath Garudadri • hgarudadri@ucsd.edu, hari@nadiworks.com
TA: TBD
B11 6 seats 4 to 5 PM, weekdays and Weekends

HCI: Improve the user interfaces for the hearing impaired using GUI, Voice User Interfaces, and Gesture-based Interfaces. Reinforcement: Balancing Exploration and Exploitation constrained by available (but growing) feedback data from users.

Read more

About: I run a lab at UCSD that conducts fundamental and translational research at the intersection of Technology, Healthcare, and Education -- THELab. I am the Founder/CEO of Nadi Inc, a California C-Corp that commercializes translational research in academia. We are launching "breakthrough" Hearing Aids using a SaaS model, in time for the 2025 Holiday season. My interest in volunteering to mentor DSC Capstone is to solve the hard problems in the area of Hearing Aids, and restore the function of our ability to communicate in a natural language.
Mentoring Style: I am assuming Capstones at HDSI may be different from the ones in ECE. I will have weekly presentations by a selected student and will count to grading.
Suggested Prerequisites:
Summer Tasks: - https://a.co/d/9bXIFSq (Book: Thinking in Systems) - PyTorch - NLTK
Previous Project

Scrolling sound and talking hand
Victor Minces, Virginia de Sa • vminces@ucsd.edu, desa@ucsd.edu
TA: TBD
B12 6 seats Any day after 10am, preferably Tuesday to Thursday

My team has been developing an audio listening interface that can transform the way people consume podcasts, audiobooks, and videos. People are increasingly acquiring, creating, and sharing knowledge through audio and video, rather than reading. A problem with audio information, as opposed to reading, is that it is very difficult to 'scroll'. For example, if you space out while listening to a podcast, it can be cumbersome to find the last moment you were paying attention. It can also be difficult to skim through audio, dynamically changing the playback speed, forwards and back, to find the information you need. We are looking to solve this problem by creating a more organic listening interface. The task is creative, the possibilities endless, and there is a lot to learn. Another direction of my team is a 'talking hand', which transforms hand movements into speech. It can also sing! https://talkinghandminces.netlify.app/

Read more

About: Victor Minces studied fine arts and physics at the University of Buenos Aires, and obtained his Ph.D. in computational neuroscience at UCSD. He researched how the brain represents sensory stimuli, including light and rhythmic sound, and the cognitive basis of musical rhythm. He created a program making widely used web applications and hands-on activities for people to play, experiment, and learn about sound. Besides his science and technology endeavors, he is a sound artist and performer, leading audiences of hundreds of people to explore and create sound together. Recently, his team has been designing original sonic interfaces for people to create sound and music, listen to audiobooks, and communicate. HDSI's Associate Director de Sa is a leader in the fields of cognitive science, neuroscience, computer science, engineering, and data science. Her research utilizes multiple approaches to increase our understanding of how humans and machines learn to perceive the world around them. She earned her Ph.D. and master’s in Computer Science from the University of Rochester, and a bachelor’s degree in Mathematics and Engineering from Canada’s Queen’s University.
Mentoring Style: My students typically work independently, but I am very supportive. I meet with them once a week, but I am open (and happy) to follow up more often. I have experienced undergrads working with me, who can help guide new students. Dr. De Sa, and perhaps her students, will also be involved in mentoring, specifically as it refers to machine learning.
Suggested Prerequisites:
Summer Tasks: - Research algorithms to slow down and speed up sound, such as vocoders and granular synthesis. Reproduce one if possible. - Learn about frequency decomposition, spend some time playing with https://spectrogram.sciencemusic.org/ - If you are interested in the talking hand, watch and carefully study this video: https://youtu.be/aFnWSBKImQU?si=hwO5GX55HA3oV6KD - And this video: https://youtu.be/fxYCk1t5DKE?si=Lrn7c_JzvD9ueTB2 - And if you can, watch the whole series. - Reach out to me and I can assign you some tasks depending on your current knowledge and interest.
Previous Project

Wildfire and Property Intelligence Modeling with Cotality
Ilyes Meftah, Lawrence Vulis, Peter Nagy •
TA: TBD
B13 12 seats Preferably Thursday or Friday mornings Industry Partner: Cotality

This capstone will offer students the opportunity to select from three rich domains of applied data science in collaboration with Cotality: **Domain 1 — Wildfire Risk Intelligence: Advanced Modeling for Climate Resilience** Students will enhance wildfire catastrophe models by designing machine learning models to simulate fire intensity within wildfire perimeters. Focus is on improving hazard modeling to support insurance pricing, emergency planning, and resilience. **Domain 2 — Wildfire Risk Intelligence: Statistical Modeling of Wildfire Frequency** This project explores methods for assigning realistic event frequencies to wildfire footprints, matching historical damage patterns using statistical modeling and machine learning. It will provide exposure to risk quantification, spatial data processing, and policy-relevant analytics. **Domain 3 — Property Intelligence: Enhancing Geospatial Data Quality for Risk Assessment** Students will apply machine learning to improve land use classification across county boundaries and enhance data quality in nationwide parcel-level property databases. The goal is to refine features used across catastrophe models and climate analytics platforms. Across all domains, students will engage in hands-on work with real datasets, industry tools (Python/R, GIS), and catastrophe modeling techniques. The outputs are intended to directly improve Cotality’s modeling platforms and have measurable real-world impacts.

Read more

About: Ilyes Meftah has been a data scientist and catastrophe modeler with Cotality for 13 years. With a strong background in mathematics and quantitative finance (holding multiple master's degrees from Paris, France universities), Ilyes has developed risk assessment models for wildfires, hurricanes, and earthquakes throughout his career. Recently, he has been focusing his efforts on quantifying wildfire mitigation measures to help communities located in high-risk areas. He is passionate about solving complex problems and sharing knowledge with others. When not working on catastrophe models, he enjoys hiking around the world with his family. Lawrence Vulis is a senior hazard scientist at Cotality, where he works on building physical and AI-based models of natural hazard risk to properties. Prior projects include machine learning-based classification of river delta geometry, satellite-based tracking of arctic lake spatiotemporal dynamics, linking satellite-derived beach dynamics with off-shore wave climate in Southern California, and a geospatial database/platform for machine learning-based permafrost mapping. His educational background is in Civil and Environmental Engineering, with a B.E. from The City College of New York and a Ph.D. from UC Irvine, with an extended internship and brief stint at Los Alamos National Lab. Outside of work he enjoys spending time with his wife and dog on beaches and trails. Peter Nagy has been with Cotality for 15 years, applying his passion toward big spatial data problems that occur with parcels, buildings, and geographic data relating to natural hazard risks. Prior experience includes the virtual earth (streetside) team with Microsoft, as well as multiple projects with Vexcel including SRTM processing, feature extraction from radar imagery, visualizations of raster and vector imagery like polarimetric SAR compositions, and building the RAMS Antarctic DEM. He studied at the University of Colorado in Boulder where he still lives, enjoying outdoor activities like hiking and skiing.
Mentoring Style: Our mentoring approach combines structure with creativity in a collaborative environment. Weekly sessions will balance technical guidance with hands-on problem-solving. Students will have opportunities to interact with multiple catastrophe modeling experts at Cotality, gaining exposure to different perspectives and specialized knowledge. We believe learning works best when it's engaging and enjoyable, so we'll incorporate real-world applications and team-based challenges throughout the project. While we'll provide regular guidance and feedback, we value student initiative and will encourage independent exploration of solutions within our project framework. Our goal is to create an experience that's both intellectually stimulating and professionally valuable.
Suggested Prerequisites:
Summer Tasks: Suggested preparation: - Spatial Cluster Analysis (YouTube playlist): https://www.youtube.com/playlist?list=PLzREt6r1Nenk3L0ndufhYuwdrrfZqdsIA - Spatial Data Science General Topics (YouTube playlist): https://www.youtube.com/playlist?list=PLzREt6r1NenmFyTw8v2JZpEE4PZGNi5Ht - Python GIS Textbook (Part II and III): https://pythongis.org/part2/index.html - R users: Get comfortable with terra & sf libraries and spatial point pattern analysis More detailed resources will be shared as needed.
Previous Project

Comparing Maps from Human Brain Imaging
Armin Schwartzman • armins@ucsd.edu
TA: TBD
B14 10 seats Wednesday 3:30-4:30PM (with some flexibility around this)

The organization of the human brain is incredibly complex. Over the past decades in neuroscience, researchers have measured the human brain with increasing spatial and temporal resolution–using imaging, recording, tracing, and sequencing technologies. This has yielded detailed maps of the spatial structure (e.g. cortical thickness, receptor densities, gene expression) and functional properties (e.g., task activations, functional connectivity) of the brain. Comparing these brain maps is fundamental to understanding the complexities of brain organization, but presents unique challenges. Measures of spatial associations between brain maps are strongly influenced by spatial autocorrelation, leading to inflated false positives if not properly accounted for. This project explores methods in spatial statistics for testing the association between brain maps, enabling researchers to more accurately interpret map-to-map similarities and differences. We will learn how to: (1) work with neuroimaging data in Python, (2) understand and implement existing methods for comparing maps while accounting for spatial autocorrelation, (3) and apply these methods to open-access datasets.

Read more

About: With an undergraduate degree in electrical engineering, I discovered statistics for my PhD and have been doing data science since then (even when it wasn't called by that name). Much of my work involves signal and image analysis, but I'm interested in many theoretical and applied problems, even philosophical. Outside of academia, I like doing music, dancing, swimming, surfing, and more.
Mentoring Style: Mentoring will involve data science PhD student Gabriel Riegner. Students are expected to take ownership over the project. This implies taking initiative in learning about the topic (from the assigned material and other sources), implementing the methods in code, being resourceful when needing help, and asking questions. Students are expected to put in their best effort, plan their time over the quarter, make substantial progress each week, report on it each week, and come up with an action plan for the next steps (as opposed to waiting for the mentor to give instructions). In other words, be independent and ask for help when needed.
Suggested Prerequisites:
Summer Tasks: Reading material: - Comparing spatial null models for brain maps, Markello and Misic 2021 (paper: https://pubmed.ncbi.nlm.nih.gov/33857618/, code: https://markello-spatialnulls.netlify.app/#) - Neuromaps: structural and functional interpretation of brain maps, Markello et. al. 2022 (paper: https://pubmed.ncbi.nlm.nih.gov/36203018/, code: https://netneurolab.github.io/neuromaps/)
Previous Project

Differentially private synthetic telemetry data with modern generative AI
Yu-Xiang Wang, Bijan Arbab • yuxiangw@ucsd.edu, barbab@ucsd.edu
TA: TBD
B15 8 seats Monday afternoon 4pm (tentative)

Differentially private synthetic data generation plays a crucial role in enabling data sharing and analysis while preserving individual privacy. By creating artificial datasets that statistically resemble real data without revealing any specific individual's information, this approach allows organizations to comply with privacy regulations such as GDPR and HIPAA. It fosters innovation and collaboration by making sensitive data accessible to researchers, developers, and analysts without exposing personal details. Furthermore, the use of differential privacy techniques ensures that the risk of re-identifying individuals remains mathematically bounded, providing strong, quantifiable privacy guarantees. This balance between utility and privacy is essential for advancing fields like healthcare, finance, and social science in a responsible and ethical manner. In this specific project, you will work with me and HDSI Industry Fellow Dr. Bijan Arbab to

Read more

About: I am a faculty member of the Halıcıoğlu Data Science Institute at UC San Diego, also affliated with the CSE department. Broadly speaking, my students and I apply math and computing to (1) design faster, stronger and more efficient ML algorithms with provable guarantees (2) solve societal challenges (e.g., data privacy, abuse prevention) that emerge in the AI era. Our recent focus include watermarking generative AI, making differential privacy practical, bridging offline and online RL, developing a theory of adaptivity in deep learning.
Mentoring Style: Guiding the learning and research of students through weekly meetings with students who lead the projects.
Suggested Prerequisites:
Summer Tasks: Read about differential privacy (see the course I taught in Fall 2024 and the references therein https://cseweb.ucsd.edu/~yuxiangw/classes/DSC291-2024Fall/)
Previous Project

Agentic Applications and Knowledge Graphs in Life Sciences 🧪🧬
Abed El-Husseini, Balaji Veeramani •
TA: TBD
B16 8 seats Industry Partner: Deloitte

The development of compound AI systems, integrating the capabilities of Large Language Models (LLMs) with agentic frameworks, external tools, and knowledge bases, has recently gained considerable popularity. Agentic frameworks are pivotal, harnessing recent advancements in LLMs to enable synergistic interaction with external tools, thereby facilitating the creation of these sophisticated systems. Knowledge graphs play a crucial role, particularly in the life sciences, by structuring complex information related to biological knowledge, pharmaceuticals, adverse effects, mechanisms of action, and other pertinent entities. In this course we will delve into the application of integrated agentic frameworks and knowledge graphs for developing innovative solutions in life science domains. Example applications to be explored include hypothesis generation for scientific discovery, prediction of treatment outcomes, identification of adverse effects, optimization of clinical trials, and formulation of personalized health and lifestyle recommendations. This offering is designed for students interested in this interdisciplinary areas, drawing upon principles from artificial intelligence, knowledge representation, and the life sciences. The objective is to empower students to design intelligent agents capable of interrogating biomedical data to yield actionable insights. Suggested pre-requisites - Python, knowledge), GenAI architecture patterns (RAG), agentic frameworks exposure (preferred), knowledge of life sciences/ biology (will be useful but not mandatory)

Read more

About: Abed – Abed is an Applied AI Manager at Deloitte Consulting, specializing in Generative AI applications. Passionate about teaching, he has served as a business case mentor and capstone instructor for HDSI. A proud graduate of The Ohio State University, Abed now lives in Austin, Texas—the live music capital of the world—with his wife and son 🤠🎸. He's an avid runner and a dessert enthusiast, in that order. Balaji Veeramani is a specialist leader within Deloitte Consulting, helping organizations develop and adopt AI solutions responsibly. Balaji has been leading AI/ML teams developing deep learning, machine learning, data science and GenAI based solutions, for life sciences, healthcare, diagnostics, agriculture, investment management, and logistics organizations. Balaji received his Ph.D. in Biomedical Engineering from Johns Hopkins University, and a Masters in Electrical Engineering (signal processing) from Arizona State University.
Mentoring Style: casual, fun, engaging
Suggested Prerequisites:
Summer Tasks: - Knowledge Graphs: https://arxiv.org/pdf/2003.02320 - Biomedical Knowledge Graph: A Survey of Domains, Tasks, and Real-World Applications: https://arxiv.org/pdf/2501.11632 - AI Engineering (on building and evaluating agentic solutions): https://www.oreilly.com/library/view/ai-engineering/9781098166298/
Previous Project

Deep learning for understanding microbiome results
Prof. Rob Knight • rknight@ucsd.edu
TA: TBD
B18 8 seats TBA

This group is for students interested in exploring applications of deep learning to the microbiome space. The first few weeks of the course will be devoted to understanding what the microbiome is and past approaches to analyzing it, together with opportunities to apply deep learning techniques in various ways to analysis of the microbial DNA sequences, communities, and/or annotations. The specific results that will be re-analyzed and techniques used will be driven by student interests. For example, past capstone classes have focused on the microbiome as a source of health disparities in Hispanic and Latino populations, use of protein language models to identify antimicrobial peptides, and use of DNA language models to identify mutations associated with drug resistance or to analyze relationships between entire microbial communities and phenotype. Currently emerging opportunities include development of digital twin models of humans, spatial sequencing, and long-read metagenomics.

Read more

About: While researching the origins of life as a postdoc in Boulder, Colorado, I developed algorithms for comparing RNA that turned out to be enabling technologies for the whole microbiome field. I moved to UCSD at the end of 2014 to lead the Center for Microbiome Innovation, and my lab focuses on developing new technologies to read out and understand complex microbial communities in the environment and the human body. I gave a TED talk that has been viewed > 2 million times and wrote two popular books on the microbiome. Outside academia, I like to travel, cook, hike, and paddleboard.
Mentoring Style: Capstone group will run as separate, focused entity, with relevant members of lab (grad students, postdocs) providing additional perspective/co-mentorship. Attendance of lab meetings/code reviews for the whole lab is optional.
Suggested Prerequisites:
Summer Tasks: "https://www.ted.com/talks/rob_knight_how_our_microbes_make_us_who_we_are?language=en Familiarity with Python, Pandas and PyTorch are essential. Depending on project, TensorFlow, scikit-bio and STAN could be useful."
Previous Project

From Data to Dispatch - Optimizing SDG&E Field Services
Phi Nguyen, Chuck Hahm, Fatemeh Aarabi •
TA: TBA
B19 12 seats TBD Industry Partner: SDG&E

This capstone offers students a unique opportunity to partner with San Diego Gas & Electric (SDG&E) to improve how field services—like metering, inspections, and emergency repairs—are delivered across the region. Students will work directly with field technicians and analysts to explore how data science can optimize truck dispatches, reduce operational costs, and enhance safety, all while improving the customer experience for San Diegans. Using real-world utility data, students may apply machine learning, geospatial analysis, and optimization techniques to solve challenges such as predicting equipment failures, streamlining technician routes, or identifying service anomalies. This project is ideal for students eager to connect data with real-world impact, gain experience in applied analytics, and contribute to a more efficient and customer-focused energy future.

Read more

About: Dr. Nguyen graduated from UCSD with a Ph.D. in Materials Science and Engineering, where he developed nanomaterials for clean energy applications. He then worked for several years as a consultant in the energy sector, where his focus was on using data to support policies that promote clean energy and energy efficiency. Dr. Nguyen joined SDG&E to focus on developing models to mitigate wildfire risk in California and has since expanded his work to other areas that benefit San Diego communities. Charles ("Chuck") Hahm is a Data Scientist in SDGE's Customer Field Service organization. His data science experience spans a range of industries, including cybersecurity, customer analytics, sensor analytics, medical diagnostics, and image processing. He has served as adjunct faculty and course developer in National University's graduate analytics program. In the government sector, he has served as Principal Investigator for SBIR (Small Business Innovative Research) grants for the U.S. Navy, U.S. Air Force, and National Institutes of Health. Chuck holds a master's degree in electrical engineering from the Illinois Institute of Technology and a bachelor's degree from the University of Illinois at Chicago. Dr. Fatemeh Aarabi holds a PhD in operations research from State University of New York at Buffalo. During her doctoral studies, she focused on applied operations research methods like routing and scheduling algorithms with applications in urban systems. After graduation she joined industry to develop optimization frameworks for emergency management systems, working on optimization algorithms and ML predictive methods to reduce the EMS response time. In 2022 Fatemeh joined SDGE as a data scientist where she has focused on developing models to mitigate wildfire risk in California.
Mentoring Style: The student group will be a stand-alone unit at SDG&E led by Mentors. Mentors will first work with students to understand utility space, and then schedule time with other SDG&E staff who will provide tours, field visits, and other utility-specific training. Students will also be introduced to other data scientists and engineers at SDG&E who are available for support on an as-needed basis throughout the duration of the project. However, once an introduction is made, it will be up to the students to reach out to staff when support is needed. Students will be encouraged to present their ideas to staff members beyond the mentors.
Suggested Prerequisites:
Summer Tasks: None
Previous Project

Sepsis — Using Clinical Healthcare Data Science to Identify and Combat an Infectious Killer
Kyle Shannon • kshannon@ucsd.edu
TA: TBA
B20 8 seats Thursdays 1:30–2:30 PM

Sepsis is a life-threatening condition caused by the body’s extreme response to an infection. Early detection and intervention are crucial for improving patient outcomes. This project aims to develop a radiographic-enhanced clinical decision support system for early sepsis detection and risk assessment. The system leverages chest X-ray images and patient metadata to predict the probability of sepsis development within specific time frames after the X-ray is taken. The proposed pipeline consists of two main components: 1) A ResNet model that predicts lung anomalies based solely on chest X-ray images 2) A CatBoost model that combines the output of the first model with patient vitals and other relevant metadata to predict whether the patient is at risk of sepsis. The project involves extensive data engineering to preprocess and integrate the MIMIC-IV and MIMIC-CXR datasets.

Read more

About: Hi 👋 I’m Kyle Shannon, as a professional in the public health and data science fields, I am dedicated to improving healthcare accessibility and enhancing patient outcomes, particularly in rural America. My journey began at UCSD, where I studied in the CogSci department as an undergraduate and discovered my passion for Data Science when it was still an emerging field (2013). I later pursued my master's degree in Data Science at UCSD, and eventually co-founded a startup focused on healthcare access. My enthusiasm lies in data science projects that directly impact patient health outcomes, and I maintain a keen interest in cognitive neuroscience and tiny ML systems. Outside of work, you can find me on a tennis court or delighting in the ambiance of a cozy cafe while tackling projects.
Mentoring Style: My goal is to create a capstone experience that emulates a practical "job" setting, guiding students in effectively interacting with managers and data science leads, asking relevant questions, and fulfilling their responsibilities. I may assume various roles (e.g., DS lead, stakeholder, hospital admin, manager) to offer diverse perspectives. I incorporate a business angle to discuss the project's broader context, encouraging students to envision their work in scenarios such as product development or hospital consultancy. This approach helps them grasp real-world applications and develop a compelling narrative for their projects. I prioritize accessibility for my students throughout the week, for example, via Discord, and may involve domain experts for them to interview and learn from professionals in ICUs and EHR data. This context adds valuable insight and humanizes the data/system. I often hold informal meetings with my students over coffee to discuss progress and answer questions. Occasionally, I expect them to provide progress reports and mini-presentations, simulating a real-world organizational experience.
Suggested Prerequisites:
Summer Tasks: The following are recommended summer domain readings and tasks. Getting through some or all of these, especially if you are a bit unfamiliar with the domain, would be a good idea and help you hit the ground running in the fall. I will be available during the summer to meet with you as a group once or twice if you wish. For clarity, during the summer, the three areas I recommend focusing on would be: - Familiarizing yourself with EHR data - Learning about the MIMIC dataset - Beginning to understand a bit more about clinical critical care in an ICU
Previous Project

Deep learning analysis and augmentation of medical images
Albert Hsiao • hsiao@ucsd.edu
TA: TBA
B21 10 seats Mondays at 11am

This group will develop hands-on skill sets for developing deep learning algorithms for medical imaging. In the first quarter, students will reproduce our lab's prior results in chest radiography, including the estimation blood serum markers of heart failure from chest radiographs, using standard CNN architectures. In the second quarter, students will investigate new approaches for interfacing specialized CNN image encoders with strategies that interrogate algorithm explainability.

Read more

About: Albert Hsiao, MD, PhD is a cardiothoracic radiologist trained in engineering at Caltech and bioengineering and bioinformatics in the UC San Diego Medical Scientist Training Program (MSTP). He completed his residency and fellowships in Interventional Radiology and Cardiovascular Imaging at Stanford before returning to UC San Diego as faculty in Radiology, where he leads advanced cardiovascular imaging and the Augmented Imaging and Data Analytics (AiDA) research laboratory. While a radiology resident at Stanford, he co-founded Arterys, a cloud-native software company to bring 4D Flow MRI and artificial intelligence technologies to market. He continues to partner with industry to develop and create new imaging technologies to improve diagnosis and management of disease.
Mentoring Style: Students will meet weekly with me and additionally interface with scientists, post-doctoral fellows and graduate students in the lab.
Suggested Prerequisites:
Summer Tasks: May be worthwhile reviewing this paper which will be a substantial component of the first quarter https://ieeexplore.ieee.org/document/9768796
Previous Project

Classification of lab mouse behavior in various disease models
Benjamin Smarr, Manny Ruidiaz • bsmarr@ucsd.edu, manny.ruidiaz@murine.org
TA: TBA
B22 6 seats Thursdays 3pm Industry Partner: TLR Ventures

Jackson Labs (Jax) breeds many of the mice used in biomedical research. These mice often have specific generic differences that model aspects of disease. Phenotyping these animals involves connecting changes in their genes to changes in their actual lives. This is still mostly done by hand, where someone looks and notes how excitable or stressed or attentive an animal is. But humans are error prone and slow, so there's been a rise in efforts to video capture animals in their home cages, and then use AI to identify differences in behavior. While a number of tools to support this exist now, parametrization of video data is often suboptimal or not biologically grounded, and results are often hard to visualize. TLR Ventures is a start up developing AIs to phenotype Jax mice from videos. As a new industrial partner with HDSI they are excited to see what clever solutions and tools students will come up with to improve phenotyping and or visualization tools to assist biologists who want to make use of these data. Through this experience you will learn about video processing and featurization, biomedical research and biological rhythms, signal processing, and you can help decide on additional areas of focus, as in visualization, application of transformers, or anything else you might want to explore. You will also gain experience interacting with industry partners, and if you find the work compelling, possibly also a job opportunity.

Read more

About: Prof. Smarr comes from a neuroscience and biological rhythms background. His lab focuses on using longitudinal data sources to develop novel analytics that reveal biologically relevant information from these data, framed by an understanding of the way biological data tend to change at different timescales. This is sometimes naturalistic, but more often related to biomedical algorithm development. Manny Ruidiaz is a Distinguished Technology Leader with demonstrated experience in agricultural, biotechnology, machine learning and applied imaging sectors. He’s delivered multiple successful, high-impact, end-to-end projects, and has enabled new business opportunities.
Mentoring Style: I like to support exploration. I will provide overviews of techniques and relevant biology, and I will help you identify goals each week. You will do that with me, so that we use this opportunity to challenge you, but also to provide an excuse for you to learn things you wanted to learn, but maybe needed data and or guidance to actually dive in.
Suggested Prerequisites:
Summer Tasks: Learn a little about TLR Ventures (https://theorg.com/org/tlr-ventures) and about Jax mice https://www.jax.org/). Please also look into automated behavior analysis. The following is a good paper giving an example of how good video analysis can uncover important classifications that were not obvious by eye or by statistical aggregation.
Previous Project

AI-Assisted Disease Ontology Standardization & Automated Tagging
Raju Pusapati PhD, Raghunandha Reddy Burri PhD, Murali Krishnam, Justin Eldridge •
TA: TBA
B23 8 seats 9AM–10AM, Day TBD Industry Partner: Solix

A comprehensive disease knowledge base integrates ICD-10, WHO classifications, MeSH terms, and rare disease registries to enable standardizing disease definitions including subtypes, phenotypic variations and evolving classifications; and reduces ambiguity in electronic health records (EHRs), genomic studies, and real-world evidence (RWE) analysis. However, challenges persist due to: - Heterogeneous disease nomenclatures (e.g., "Type 2 Diabetes" vs. "T2DM" vs. "NIDDM") - Inconsistent hierarchical mappings (e.g., autoimmune disorders, cancer subtypes) - Sparse or conflicting annotations for rare diseases - Manual curation bottlenecks in updating ontologies (e.g., ICD-11 transitions) AI-assisted data curation addresses these gaps by automating: - **Semantic Labeling & Tagging**: AI-assisted labeling and tagging of disease names, synonyms, and subtypes from unstructured text (EHRs, literature) helps organize data and automate repetitive Human-in-the-Loop feedback. Additionally, AI-driven data sampling for model training will overcome class imbalance and scarcity of labeled data. - **Hierarchical Classification**: Machine learning models will map diseases to standard ontologies (ICD, SNOMED, MONDO) and infer subtype relationships. - **Conflict Resolution**: Disambiguation of overlapping terms (e.g., "ALS" as amyotrophic lateral sclerosis vs. Advanced Life Support). **Expected Project Deliverables**: 1. Ontology Mapping with detailed data dictionary and schema documentation 2. Rules and Analysis conducted for data cleanup for conflicts/outliers 3. Demo of the work including models and data stores used (SQL, NoSQL, Vector DB), architecture stack, and training methodology 4. Documentation including pseudo/source code and model performance metrics (precision, recall, F1-score) for semantic labeling and hierarchical classification

Read more

About: Raju Pusapati: Drug Discovery Specialist with 15+ years experience spanning across basic, translational and clinical cancer research Raghunandha Reddy Burri: Computational Drug Discovery professional with 15+ years experience working with Academia & Pharma industry Murali Krishnam: Drug Development AI Product & GTM Leader with 25+ years experience of successful product launch, commercialization and exit
Mentoring Style: Provide guidance and direction, assist with planning activities, brainstorm with approaches for problem solving, evangelize and quantify the benefits/impact
Suggested Prerequisites:
Summer Tasks: - Medical Dictionary such as WHO - Medical Coding such as ICD-10/ICD-11 - Basic Understanding of Real World Data (RWD) such as EHR and Real World Evidence (RWE) - Basic understanding of Semantic Models, Ontologies & Graph RAG (retrieval augmented generation using graph databases)
Previous Project

Gen AI for Good - Joint work with Ali Arsanjani of B26.
Samuel Lau • lau@ucsd.edu
TA: TBA
B24 8 seats Wed or Friday afternoon

See B26.

Read more

About: Sam Lau is an Assistant Teaching Professor in the Halıcıoğlu Data Science Institute at UC San Diego. His research creates novel interfaces for learning and teaching data science, including the popular Pandas Tutor tool (https://pandastutor.com/) which serves over 40,000 people per year. He is the author of Learning Data Science, published by O’Reilly Media in 2023.
Mentoring Style: Most teams will work on independent projects, although there are options for students who wish to contribute to a larger system or existing tool.
Suggested Prerequisites:
Summer Tasks:
Previous Project

Supply-Demand Group Headcount Forecasting
Akash Shah, Lisa Li, Bor-Chau Juang, Pratyush Panda, Victor Calderon •
TA: TBA
B25 4 seats Wed 2–3pm Industry Partner: Intuit

The challenge of workforce management in customer-facing operations centers on effectively aligning staffing levels with fluctuating customer demand across diverse interaction channels. This problem space involves not only forecasting incoming customer volumes but also determining the optimal allocation of personnel, often referred to as headcount forecasting, to meet predefined service objectives. A significant complexity arises from the need to consider various operational constraints, such as target response times (e.g., Average Speed to Answer), average interaction durations (e.g., Average Handle Time), and overarching service level agreements. Furthermore, the relationship between customer demand types and the staff groups capable of handling them is often intricate, featuring many-to-many mappings where a single staff group might service multiple demand categories, and a single demand category might be serviceable by several different staff groups, each with potentially varying skill sets and efficiencies. The core goal is to ensure adequate staffing to meet demand effectively while optimizing resource utilization and maintaining service quality.

Read more

About: Lisa Li is an AI/Data Scientist at Intuit, excited to mentor UCSD students in tackling industry-level challenges. With seven years of experience spanning the insurance sector and now tech at Intuit, she specializes in applying Deep Learning, Time Series models, and LLM models to create impactful solutions. Lisa holds a Master's degree in Data Science from GSU and a Bachelor's in Math from UCLA, and is eager to guide students through developing and deploying data-driven models that address real-world business problems. Borchau Juang’s professional role at Intuit involves AI research and development, specifically focusing on AI/ML solutions to enhance customer success within the Virtual Expert Platform (VEP). This encompasses the development of Large Language Model (LLM) applications for the extraction of expert knowledge, the refinement of issue resolution processes through agentic AI workflows, and the construction of recommendation systems to optimize customer routing and matchmaking. Pratyush Panda is a Machine Learning Engineer at Intuit with over three years of experience in his current role. He brings a wealth of experience from previous machine learning engineering positions at Samsung Electronics America and Informatica. His expertise includes machine learning, Python, and Artificial Intelligence (AI), with a background in developing and deploying AI/ML solutions. Pratyush holds a Master of Science in Computer Science from California State University - East Bay. Victor Calderon is a Machine Learning Engineer at Intuit, specializing in Generative AI and LLMs. An astrophysicist-turned-data-scientist, he focuses on applying Generative AI solutions to customer-related problems. Prior to Intuit, Victor developed and deployed computer vision models and MLOps pipelines at 5x5 Technologies Inc. He holds a Ph.D. in Physics with a focus on computational astrophysics from Vanderbilt University.
Mentoring Style: We plan to take an engaged but student-led approach to mentoring. We’ll work closely with the students throughout the project – meeting regularly, providing guidance, and being available for feedback and support. We’re also looking for students that can take ownership of their learning and direction, and can execute on the feedback provided. We’ll help them think critically, problem-solve, and communicate their process and outcomes clearly. Outside of the set office hours, we will do our best to respond to any inquiries within 48 hours, ideally sooner.
Suggested Prerequisites:
Summer Tasks: We will offer context and background as summer reading. Good to review: - Python - Forecasting analysis - Pandas - PyTorch
Previous Project

GenAI for Good
Ali Arsanjani • arsanjani@google.com
TA: TBA
B26 8 seats TBA

Generative AI for Good refers to the application of generative artificial intelligence (AI) techniques to address societal challenges and promote positive outcomes. In the context of misinformation and disinformation detection and mitigation, it involves leveraging generative AI models to combat the spread of false or misleading information and reduce socio-political polarization. Generative AI models, such as language models and deep learning algorithms, have shown remarkable capabilities in generating text and content that closely resembles human-produced content. These models can be trained to understand and analyze large amounts of data, including news articles, social media posts, and online discussions, to detect patterns and identify potential misinformation or disinformation. By employing generative AI techniques, it becomes possible to develop sophisticated algorithms and systems that can automatically identify false or misleading information, distinguish it from accurate information, and mitigate its impact on public opinion and discourse. These systems can analyze the content, context, and sources of information, looking for inconsistencies, logical fallacies, and biases that are indicative of misinformation. Generative AI can also play a crucial role in reducing socio-political polarization by promoting more balanced and factual narratives. By identifying and flagging content that contributes to polarization, algorithms can provide users with alternative viewpoints, fact-checking information, or context that helps to counterbalance the biases inherent in some narratives. This can encourage critical thinking, promote a more informed public, and foster constructive dialogue across diverse perspectives. However, it is important to note that generative AI techniques are not without challenges. Ensuring the accuracy and fairness of these models, avoiding biases, and balancing freedom of expression with the need to combat misinformation are critical considerations. Ethical guidelines and rigorous validation processes should be put in place to address these concerns and ensure the responsible and effective deployment of generative AI for good in the context of misinformation and disinformation detection and mitigation. alternusvera.com

Read more

About: www.linkedin.com/in/ali-arsanjani
Mentoring Style: As a formal course, and in teams
Suggested Prerequisites: NLP
Summer Tasks: Learn NLP, especially with large language models from Google using Google AI Studio.
Previous Project

–>

Domain Descriptions

Overview

Enrollment

How should I choose a domain?

What happens in DSC 180A?

Who is overseeing the capstone?

AI/ML Systems

Data Valuation & Curation for Trustworthy AI Babak Salimi • TA: TBA A01 8 seats Friday 3:00 – 4:00 PM PT

Hardware Acceleration of ML Algorithms Rajesh K Gupta • rgupta@ucsd.edu TA: TBD A02 6 seats Wed 11-12 (Sat 11-12 is an option as well)

Trustworthy Machine Learning Lily Weng • lweng@ucsd.edu TA: TBD A03 8 seats Monday at 4pm

Causal Copilot Biwei Huang • bih007@ucsd.edu TA: TBA A04 8 seats Friday afternoon

Language Models

Interplay between Machine Unlearning and Optimization Jun-Kun Wang • jkw005@ucsd.edu TA: In-person (need room booked for me; required if mentoring >4 students in-person) A05 4 seats Fridays at 1PM.

Open LLM Training, Inference, and Infrastructure Hao Zhang • haz094@ucsd.edu TA: Zoom A06 8 seats Mondays 3-4pm

Evaluation Strategies for Next-Generation AI Systems Rajeev Chhajer, Ryan Lingo • TA: TBA A07 12 seats Mondays at 12pm-1pm PST Industry Partner: Honda Research Labs

Large Language (Multi-Modal) Model Reasoners and Agents Zhiting Hu • zhh019@ucsd.edu TA: TBA A08 10 seats Tuesday 3-4PM

Community-Centered Discrimination Audits of LLMs - Bias Rapid Action Teams Stuart Geiger • sgeiger@ucsd.edu TA: TBA A09 6 seats Wednesdays 10-11am

Training Baby Language Models from Scratch Alex Warstadt • awarstadt@ucsd.edu TA: TBD A10 6 seats Tuesday 11am

Explorations on in-context learning in LLMs Prof. Arya Mazumdar • amazumdar@ucsd.edu TA: TBD A11 8 seats Tuesdays at 1pm

Quantifying the credibility of large language model outputs Yian Ma • yianma@ucsd.edu TA: TBA A12 8 seats Thursday at 10am

🧠 Theoretical Foundations

Communication Complexity Shachar Lovett • slovett@ucsd.edu TA: TBA A13 4 seats Mondays 1-2pm

Understanding deep learning through feature learning Tianhao Wang • tianhaowang@ucsd.edu TA: TBA A14 8 seats Friday

Transformers for graph learning Yusu Wang and Gal Mishne • yusuwang@ucsd.edu; gmishne@ucsd.edu TA: TBD A15 16 seats Wed morning 9am preferred

Theoretical Computer Science Barna Saha • barnas@ucsd.edu TA: TBD A16 4 seats Early morning on Friday

Simulation coding exercises for teaching probability theory Peter Chi • pbchi@ucsd.edu TA: TBA A17 8 seats Mondays 3-4pm

Applied Data Science

NLP Credit Score Development Brian Duke, Kyle Nero • TA: TBA B01 12 seats Thursday 1-2p Industry Partner: Prism Data

Mining Privacy Designs in the News Haojian Jin • h7jin@ucsd.edu TA: TBA B02 10 seats Tuesday afternoon

Digital twin model for health with wearable data Tauhidur Rahman • trahman@ucsd.edu TA: TBA B03 8 seats Monday AM. We will figure a specific timeslot out when the quarter gets near and we know all of our schedules better.

Developing lessons to ease teachers into time series data science Benjamin Smarr • bsmarr@ucsd.edu TA: TBA B04 8 seats Wednesdays 3pm

Deep Learning for Climate Model Emulation Duncan Watson-Parris • dwatsonparris@ucsd.edu TA: TBA B05 6 seats Wednesday 2-3pm

Analysis of Temporally Varying Point Cloud using Optimal Transport Alex Cloninger, Rayan Saab • acloninger@ucsd.edu, rsaab@ucsd.edu TA: TBA B06 8 seats Mondays 2-3pm

ALERTCalifornia - Extreme Events Detection Nathan Hui , Falko Kuester, Neal Driscoll • nthui@ucsd.edu, fkuester@ucsd.edu, ndriscoll@ucsd.edu TA: TBD B07 4 seats Tuesday 11am

Large Language Models in Healthcare Aaron Boussina; Karandeep Singh • aboussina@health.ucsd.edu; karandeep@health.ucsd.edu TA: TBA B08 6 seats Monday, 10am

Hunting for Ghost Particles - Analyzing Time Series Data produced by Semiconductor Detectors Aobo Li • aol002@ucsd.edu TA: TBD B09 4 seats Wednesday 3-4

Blockchain Rajesh K Gupta • rgupta@ucsd.edu TA: TBD B10 6 seats Wed 10-11 (Saturday 10AM is a standing alternative)

HCI and Reinforcement Learning for Hearing Loss Compensation Harinath Garudadri • hgarudadri@ucsd.edu, hari@nadiworks.com TA: TBD B11 6 seats 4 to 5 PM, weekdays and Weekends

Scrolling sound and talking hand Victor Minces, Virginia de Sa • vminces@ucsd.edu, desa@ucsd.edu TA: TBD B12 6 seats Any day after 10am, preferably Tuesday to Thursday

Wildfire and Property Intelligence Modeling with Cotality Ilyes Meftah, Lawrence Vulis, Peter Nagy • TA: TBD B13 12 seats Preferably Thursday or Friday mornings Industry Partner: Cotality

Comparing Maps from Human Brain Imaging Armin Schwartzman • armins@ucsd.edu TA: TBD B14 10 seats Wednesday 3:30-4:30PM (with some flexibility around this)

Differentially private synthetic telemetry data with modern generative AI Yu-Xiang Wang, Bijan Arbab • yuxiangw@ucsd.edu, barbab@ucsd.edu TA: TBD B15 8 seats Monday afternoon 4pm (tentative)

Agentic Applications and Knowledge Graphs in Life Sciences 🧪🧬 Abed El-Husseini, Balaji Veeramani • TA: TBD B16 8 seats Industry Partner: Deloitte

Deep learning for understanding microbiome results Prof. Rob Knight • rknight@ucsd.edu TA: TBD B18 8 seats TBA

From Data to Dispatch - Optimizing SDG&E Field Services Phi Nguyen, Chuck Hahm, Fatemeh Aarabi • TA: TBA B19 12 seats TBD Industry Partner: SDG&E

Sepsis — Using Clinical Healthcare Data Science to Identify and Combat an Infectious Killer Kyle Shannon • kshannon@ucsd.edu TA: TBA B20 8 seats Thursdays 1:30–2:30 PM

Deep learning analysis and augmentation of medical images Albert Hsiao • hsiao@ucsd.edu TA: TBA B21 10 seats Mondays at 11am

Classification of lab mouse behavior in various disease models Benjamin Smarr, Manny Ruidiaz • bsmarr@ucsd.edu, manny.ruidiaz@murine.org TA: TBA B22 6 seats Thursdays 3pm Industry Partner: TLR Ventures

AI-Assisted Disease Ontology Standardization & Automated Tagging Raju Pusapati PhD, Raghunandha Reddy Burri PhD, Murali Krishnam, Justin Eldridge • TA: TBA B23 8 seats 9AM–10AM, Day TBD Industry Partner: Solix

Gen AI for Good - Joint work with Ali Arsanjani of B26. Samuel Lau • lau@ucsd.edu TA: TBA B24 8 seats Wed or Friday afternoon

Supply-Demand Group Headcount Forecasting Akash Shah, Lisa Li, Bor-Chau Juang, Pratyush Panda, Victor Calderon • TA: TBA B25 4 seats Wed 2–3pm Industry Partner: Intuit

GenAI for Good Ali Arsanjani • arsanjani@google.com TA: TBA B26 8 seats TBA

Data Valuation & Curation for Trustworthy AI
Babak Salimi •
TA: TBA
A01 8 seats Friday 3:00 – 4:00 PM PT

Hardware Acceleration of ML Algorithms
Rajesh K Gupta • rgupta@ucsd.edu
TA: TBD
A02 6 seats Wed 11-12 (Sat 11-12 is an option as well)

Trustworthy Machine Learning
Lily Weng • lweng@ucsd.edu
TA: TBD
A03 8 seats Monday at 4pm

Causal Copilot
Biwei Huang • bih007@ucsd.edu
TA: TBA
A04 8 seats Friday afternoon

Interplay between Machine Unlearning and Optimization
Jun-Kun Wang • jkw005@ucsd.edu
TA: In-person (need room booked for me; required if mentoring >4 students in-person)
A05 4 seats Fridays at 1PM.

Open LLM Training, Inference, and Infrastructure
Hao Zhang • haz094@ucsd.edu
TA: Zoom
A06 8 seats Mondays 3-4pm

Evaluation Strategies for Next-Generation AI Systems
Rajeev Chhajer, Ryan Lingo •
TA: TBA
A07 12 seats Mondays at 12pm-1pm PST Industry Partner: Honda Research Labs

Large Language (Multi-Modal) Model Reasoners and Agents
Zhiting Hu • zhh019@ucsd.edu
TA: TBA
A08 10 seats Tuesday 3-4PM

Community-Centered Discrimination Audits of LLMs - Bias Rapid Action Teams
Stuart Geiger • sgeiger@ucsd.edu
TA: TBA
A09 6 seats Wednesdays 10-11am

Training Baby Language Models from Scratch
Alex Warstadt • awarstadt@ucsd.edu
TA: TBD
A10 6 seats Tuesday 11am

Explorations on in-context learning in LLMs
Prof. Arya Mazumdar • amazumdar@ucsd.edu
TA: TBD
A11 8 seats Tuesdays at 1pm

Quantifying the credibility of large language model outputs
Yian Ma • yianma@ucsd.edu
TA: TBA
A12 8 seats Thursday at 10am

Communication Complexity
Shachar Lovett • slovett@ucsd.edu
TA: TBA
A13 4 seats Mondays 1-2pm

Understanding deep learning through feature learning
Tianhao Wang • tianhaowang@ucsd.edu
TA: TBA
A14 8 seats Friday

Transformers for graph learning
Yusu Wang and Gal Mishne • yusuwang@ucsd.edu; gmishne@ucsd.edu
TA: TBD
A15 16 seats Wed morning 9am preferred

Theoretical Computer Science
Barna Saha • barnas@ucsd.edu
TA: TBD
A16 4 seats Early morning on Friday

Simulation coding exercises for teaching probability theory
Peter Chi • pbchi@ucsd.edu
TA: TBA
A17 8 seats Mondays 3-4pm

NLP Credit Score Development
Brian Duke, Kyle Nero •
TA: TBA
B01 12 seats Thursday 1-2p Industry Partner: Prism Data

Mining Privacy Designs in the News
Haojian Jin • h7jin@ucsd.edu
TA: TBA
B02 10 seats Tuesday afternoon

Digital twin model for health with wearable data
Tauhidur Rahman • trahman@ucsd.edu
TA: TBA
B03 8 seats Monday AM. We will figure a specific timeslot out when the quarter gets near and we know all of our schedules better.

Developing lessons to ease teachers into time series data science
Benjamin Smarr • bsmarr@ucsd.edu
TA: TBA
B04 8 seats Wednesdays 3pm

Deep Learning for Climate Model Emulation
Duncan Watson-Parris • dwatsonparris@ucsd.edu
TA: TBA
B05 6 seats Wednesday 2-3pm

Analysis of Temporally Varying Point Cloud using Optimal Transport
Alex Cloninger, Rayan Saab • acloninger@ucsd.edu, rsaab@ucsd.edu
TA: TBA
B06 8 seats Mondays 2-3pm

ALERTCalifornia - Extreme Events Detection
Nathan Hui , Falko Kuester, Neal Driscoll • nthui@ucsd.edu, fkuester@ucsd.edu, ndriscoll@ucsd.edu
TA: TBD
B07 4 seats Tuesday 11am

Large Language Models in Healthcare
Aaron Boussina; Karandeep Singh • aboussina@health.ucsd.edu; karandeep@health.ucsd.edu
TA: TBA
B08 6 seats Monday, 10am

Hunting for Ghost Particles - Analyzing Time Series Data produced by Semiconductor Detectors
Aobo Li • aol002@ucsd.edu
TA: TBD
B09 4 seats Wednesday 3-4

Blockchain
Rajesh K Gupta • rgupta@ucsd.edu
TA: TBD
B10 6 seats Wed 10-11 (Saturday 10AM is a standing alternative)

HCI and Reinforcement Learning for Hearing Loss Compensation
Harinath Garudadri • hgarudadri@ucsd.edu, hari@nadiworks.com
TA: TBD
B11 6 seats 4 to 5 PM, weekdays and Weekends

Scrolling sound and talking hand
Victor Minces, Virginia de Sa • vminces@ucsd.edu, desa@ucsd.edu
TA: TBD
B12 6 seats Any day after 10am, preferably Tuesday to Thursday

Wildfire and Property Intelligence Modeling with Cotality
Ilyes Meftah, Lawrence Vulis, Peter Nagy •
TA: TBD
B13 12 seats Preferably Thursday or Friday mornings Industry Partner: Cotality

Comparing Maps from Human Brain Imaging
Armin Schwartzman • armins@ucsd.edu
TA: TBD
B14 10 seats Wednesday 3:30-4:30PM (with some flexibility around this)

Differentially private synthetic telemetry data with modern generative AI
Yu-Xiang Wang, Bijan Arbab • yuxiangw@ucsd.edu, barbab@ucsd.edu
TA: TBD
B15 8 seats Monday afternoon 4pm (tentative)

Agentic Applications and Knowledge Graphs in Life Sciences 🧪🧬
Abed El-Husseini, Balaji Veeramani •
TA: TBD
B16 8 seats Industry Partner: Deloitte

Deep learning for understanding microbiome results
Prof. Rob Knight • rknight@ucsd.edu
TA: TBD
B18 8 seats TBA

From Data to Dispatch - Optimizing SDG&E Field Services
Phi Nguyen, Chuck Hahm, Fatemeh Aarabi •
TA: TBA
B19 12 seats TBD Industry Partner: SDG&E

Sepsis — Using Clinical Healthcare Data Science to Identify and Combat an Infectious Killer
Kyle Shannon • kshannon@ucsd.edu
TA: TBA
B20 8 seats Thursdays 1:30–2:30 PM

Deep learning analysis and augmentation of medical images
Albert Hsiao • hsiao@ucsd.edu
TA: TBA
B21 10 seats Mondays at 11am

Classification of lab mouse behavior in various disease models
Benjamin Smarr, Manny Ruidiaz • bsmarr@ucsd.edu, manny.ruidiaz@murine.org
TA: TBA
B22 6 seats Thursdays 3pm Industry Partner: TLR Ventures

AI-Assisted Disease Ontology Standardization & Automated Tagging
Raju Pusapati PhD, Raghunandha Reddy Burri PhD, Murali Krishnam, Justin Eldridge •
TA: TBA
B23 8 seats 9AM–10AM, Day TBD Industry Partner: Solix

Gen AI for Good - Joint work with Ali Arsanjani of B26.
Samuel Lau • lau@ucsd.edu
TA: TBA
B24 8 seats Wed or Friday afternoon

Supply-Demand Group Headcount Forecasting
Akash Shah, Lisa Li, Bor-Chau Juang, Pratyush Panda, Victor Calderon •
TA: TBA
B25 4 seats Wed 2–3pm Industry Partner: Intuit

GenAI for Good
Ali Arsanjani • arsanjani@google.com
TA: TBA
B26 8 seats TBA