AI-assisted coding: Experiments with GPT-4

Russ Poldrack

Professor, Department of Psychology

Associate Director, Stanford Data Science

Stanford University

How many of you have used AI to help with coding?

Today’s punch lines

Large language models (LLMs) are powerful coding assistants, but have important shortcomings
- If you are not using them to help write code, you are probably throwing away time
LLMs pose fundamental questions about how to educate and to assess learning of coding skills

https://arxiv.org/abs/2304.13187

Large language models

https://jalammar.github.io/how-gpt3-works-visualizations-animations/

Vaswani et al., 2017

OpenAI Codex

Released August 2021
Based on GPT-3
- Trained on a large body of publicly available source code including 150+ GB of Python code from 54M public GitHub repositories
“Our results show that Codex performs better than most students on code writing questions in typical first year programming exams” (Finnie-Ansley, 2022)
Deprecated in March 2023 in favor of GPT-3.5

Github Copilot

GPT-4 codes at human level

GPT-4 released March 14, 2023
AI systems were given problems from the LeetCode platform for coding practice/testing
- only problems posted after the GPT-4 training period ended (Sept 2021)
GPT-4 performs on par with human LeetCode users

Bubeck et al., 2023

Incremental improvements on GPT-4

Since early 2023 improvements have been incremental
- Most have been based on GPT-4 with improved prompting techniques or additional training data
GPT-4 still dominates but open-source models are catching up
- DeepSeek-Coder-V2 is now comparable to GPT-4

HumanEval: 164 coding problems with 8 tests each https://paperswithcode.com/sota/code-generation-on-humaneval

Why coding is an optimal use case for LLMs

LLMs can sometimes “hallucinate”

Why coding is an optimal use case for LLMs

LLMs can sometimes “hallucinate”

Why coding is an optimal use case for LLMs

LLMs are most useful in cases where work is difficult to generate but relatively easy to verify
- It’s trivially easy to test whether code is syntatically correct
- It’s usually straightforward to generate automated tests to determine whether code is functioning correctly

Experiments with GPT-4

Experiments performed in March 2023
- Before multimodal support
- Before supposed degradation of GPT-4 performance in mid-2023 (Chen et al, 2023)

Experiment 1: Interactive coding with GPT-4

Presented GPT-4 (via ChatGPT Plus) with 32 coding problems
- Primarily focused on data science/statistics/ML and text mining
Caveats:
- User was fairly novice with prompt engineering
- Assessment was completely subjective and qualitative

GPT-4 coding performance

Median 69 lines of code (31-161)
“Success” defined as running with no errors and giving a result that was reasonably close to intended result (by my subjective judgment)
“NS” meant that a relatively novice user (i.e. me) was not able to easily prompt a successful solution (i.e. w/in a few mins of effort)

Example of GPT-4 one-shot success

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline

# 1. Generate synthetic data
np.random.seed(42)
mean = [0, 0]
cov = [[1, 0.5], [0.5, 1]]
train_data = np.random.multivariate_normal(mean, cov, 32)
test_data = np.random.multivariate_normal(mean, cov, 32)

X_train = train_data[:, 0].reshape(-1, 1)
y_train = train_data[:, 1]

X_test = test_data[:, 0].reshape(-1, 1)
y_test = test_data[:, 1]

# 2. Fit three models
linear_model = LinearRegression().fit(X_train, y_train)
second_order_model = make_pipeline(PolynomialFeatures(2), LinearRegression()).fit(X_train, y_train)
ninth_order_model = make_pipeline(PolynomialFeatures(9), LinearRegression()).fit(X_train, y_train)

# 3. Compute errors
models = [linear_model, second_order_model, ninth_order_model]
train_errors = [mean_squared_error(y_train, model.predict(X_train)) for model in models]
test_errors = [mean_squared_error(y_test, model.predict(X_test)) for model in models]

print("Training errors:", train_errors)
print("Test errors:", test_errors)

# 4. Plot fitted lines
plt.scatter(X_train, y_train, color='blue', label='Training data')

X_line = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
colors = ['green', 'red', 'purple']
labels = ['Linear', '2nd-order', '9th-order']

for i, model in enumerate(models):
    plt.plot(X_line, model.predict(X_line), color=colors[i], label=labels[i])

plt.legend()
plt.xlabel('Variable 1')
plt.ylabel('Variable 2')
plt.title('Fitted lines for different models')
plt.show()

Example of GPT-4 one-shot success

Training errors: [0.5941676382175597, 0.5933518294151625, 0.5129192853729938]
Test errors: [0.7419748200389997, 0.7229125040492909, 0.9712589627709035]

GPT-4 is quite good at explaining the conceptual intent of code

GPT-4 explanation

Interactive debugging with GPT-4

11/32 attempts required additional prompting to solve the problem
Common problems
- Use of outdated package features
- Type errors (e.g. accessing a list as if it were a dict)
- Hallucinating non-existent packages or methods
- Use of inappropriate methods for particular data

An example of interactive debugging

self.continuous_model = LinearRegression()

An example of interactive debugging

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1

An example of interactive debugging

Initial comparison using in-sample train/test data

$ python hurdle4.py 
Mean squared error: 7.750558995547416

$ Rscript hurdle.R
[1] "Mean squared error: 108.965835640358"

After additional manual (human only) debugging:

$ python hurdle5.py 
Mean squared error: 5.198941931090807

$ Rscript hurdle_insample.R 
[1] "Mean squared error: 5.04963049849389"

LLMs inherit the stupidity of crowds

GPT-4 estimates the hurdle model in an incorrect way
- All other Python implementations of the model on Github use this incorrect approach

GPT-4 coding failures

9/32 attempts did not lead to a successful outcome
- Number of additional prompts ranged from 1-4
- One case was a system problem with a required library

Example: The EZ-diffusion model

The EZ-diffusion model for two-choice response time tasks takes mean response time, the variance of response time, and response accuracy as inputs. The model transforms these data via three simple equations to produce unique values for the quality of information, response conservativeness, and nondecision time. (Wagenmakers et al., 2007)

Prompting GPT-4 for the EZ-diffusion model

GPT-4 misunderstanding

result = minimize(ez_diffusion_loss, initial_guess, bounds=bounds)

GPT-4 misunderstanding

result = minimize(ez_diffusion_loss, initial_guess, bounds=bounds)

GPT-4 mathematical hallucination

def ez_diffusion(response_times, decisions):
    accuracy = np.mean(decisions > 0)
    rt_mean = np.mean(response_times)
    rt_var = np.var(response_times)
    
    v = np.sqrt(np.pi) * (4 * accuracy * (1 - accuracy) / (rt_var / rt_mean ** 2 - 1)) ** (1 / 4)
    a = rt_mean * (1 - 2 * accuracy) * v / (4 * accuracy * (1 - accuracy))
    z = a / 2

Correct equation

GPT-4 equation

Can multimodal GPT-4 with web access get it right?

Experiment 1: Summary

GPT-4 is often very successful at generating useful code
- But it sometimes fails spectacularly, especially with math
GPT-4 does not provide any signals regarding relative confidence in the answer
- The matter-of-fact nature of the answers tends to imply confidence
All answers need to be verified
- “Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important” (GPT-4 Technical Report)

Experiment 3: Generating code and tests

GPT-4 used to programmatically generate 20 coding problems in each of 5 domains (statistical and data science, physics, theoretical computer science, ecology, economics)
- e.g. “Please generate 20 prompts to ask a chatbot to create Python code to solve a variety of physics problems.”

Example prompt

“Create a Python function to model the population growth of a species using the logistic growth equation, given the initial population size, carrying capacity, and growth rate. Please embed the code within an explicit code block, surrounded by triple-backtick markers. Generate realistic values for any examples, and do not use input() commands. Create code that is modular and well-commented. Then, generate a set of pytest tests that exercise each of the functions, embedded in a separate code block.” (Italicized text added)

GPT-4 coding results

GPT-4 successfully generated code and tests for each prompt
97% of the resulting code executed successfully without errors

GPT-4 testing results

All tests successfully executed using pytest
- All tests passed for 46
- At least 1 tests failed for 54
Primary causes of failure:
- Failure of assertion (45)
- Failure to properly raise error (11)

GPT-4 testing results

mass_jupiter = 1.8982e27
radius_jupiter = 6.9911e7
result = escape_velocity(mass_jupiter, radius_jupiter)
assert pytest.approx(result, rel=1e-3) == 59564.97

E       assert 60202.716344497014 ± 6.0e+01 == 59564.97
E         comparison failed
E         Obtained: 59564.97
E         Expected: 60202.716344497014 ± 6.0e+01

In this case, the code and test value are both correct, depending on where you stand on Jupiter!
- NASA’s Jupiter fact sheet claims an escape velocity of 59.5 km/s, which seems to be the source of the test value
- This is correct when computed using the equatorial radius of 71492 km
- The value generated by the code (60.2 km/s) is correct when computed using the volumetric mean radius (69911 km)

AI code review

Code review is essential for improving code quality
- But many researchers do not have access to experts to help review their code
GPT-4 seems to be quite good at reviewing code
This could be a huge win for scientific code quality and coding education

AI code review: An example

Intentionally obfuscated Python code

from pandas import *
from numpy import *
from scipy.stats import *
maxD = 12
hc = ['Nervous', 'Hopeless', 'RestlessFidgety', 'Depressed', 'EverythingIsEffort', 'Worthless', ]
h=read_csv('https://raw.githubusercontent.com/poldrack/clean_coding/master/data/health.csv',index_col=0)[hc].dropna().mean(1)
data=read_csv('https://raw.githubusercontent.com/poldrack/clean_coding/master/data/meaningful_variables_clean.csv',index_col=0)
sc=[]
for i in range(data.shape[1]):
    if data.columns[i].split('.')[0][-7:] == '_survey':
        sc=sc+[data.columns[i]]
data=data[sc]
gs=[]
for i in range(data.shape[0]):
    if sum(isnan(data.iloc[i, :])) > 0:
        pass
    else:
        gs=gs+[i]
data=data.iloc[gs,:]
from sklearn.preprocessing import scale
data_sc = scale(data)
from sklearn.decomposition import FactorAnalysis
bicv=zeros(maxD)
for i in range(1,maxD+1):
    fa=FactorAnalysis(i)
    fa.fit(data_sc)
    bicv[i-1]=i*2 - 2*fa.score(data_sc)
npD=argmin(bicv)+1
fa=FactorAnalysis(npD)
f=fa.fit_transform(data_sc)
for i in range(npD):
    print(pearsonr(f[:,i],h[gs]))
    idx=argsort(abs(fa.components_[i, :]))[::-1]
    for j in range(3):
        print(data.columns[idx[j]], fa.components_[i, idx[j]])

https://github.com/poldrack/clean_coding

AI code review: An example

Can it find analytic errors?

Presented ChatGPT with a (GPT-4) generated python script that performs cross-validated classification, with feature selection improperly performed outside of the crossvalidation loop.

selector = SelectKBest(score_func=f_classif, k=k)
X_selected = selector.fit_transform(X, y)

# Initialize k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
accuracies = []

# Cross-validation loop
for train_index, test_index in kf.split(X_selected):
    X_train, X_test = X_selected[train_index], X_selected[test_index]
    y_train, y_test = y[train_index], y[test_index]

Asked it what is wrong with the code (in a separate chat session)

Multimodal coding assistance

AI-assisted coding workflow (as of Jan 2025)

I generally avoid using regular Chatbots
Instead, use an IDE with LLM integration and copilot features
- Github Copilot within VSCode is most common
- Cursor has more powerful chat features (but is more expensive)
Never use commercial LLM copilots with protected data
- Best to isolate all protected data from code
- Continue allows use of local open-source LLMs as copilot

Using the Cursor Composer

https://youtu.be/0O3dZUEcN4I

AI coding tools will have major implications for science and education in the coming years

How might AI coding tools improve science?

Higher quality of research code
- Code review/refactoring
- Test generation
- Generation of simulated data
It could enable more effective training of researchers in coding
- Assisted debugging could help improve debugging skills

How might AI coding tools harm science?

Lower quality of research code
- Generation of complex code with lack of understanding by researcher
AI systems cannot be trusted to test themselves

AI-generated code and reproducibility

The results of GPT-4 interactions are largely irreproducible unless extra steps are taken
- It is now possible to set the random seed for inference, but the model can change without warning
- System fingerprints are available to detect model changes

client.chat.completions.create(
  model='gpt-4',
  seed=seed,
  messages=[
    {'role': 'system', 'content': f'You are a helpful assistant'},
    {'role': 'user', 'content': f'Output a random vegetable.'},],)


# Seed: None
Counter({'Broccoli': 73, 'Carrot': 13, 'Cabbage': 6, 'Cucumber': 4, 'Cauliflower': 3, 'Spinach': 1})

# Seed: 1234
Counter({'Cucumber': 100})

Educational issues around AI-assisted coding

How do we assess understanding?
- It will be increasingly easy for students to complete standard coding problems using AI tools.
Do we need to teach coding differently?
- e.g. greater focus on debugging, testing, and code review
- Formative assessments using LLM interactions

Finnie-Ansley et al, 2022

Educational issues around AI-generated code

We need to focus more on training students to work with AI tools (aka Programming 3.0)
- What kinds of new interactions do students need to learn?

https://martinfowler.com/articles/2023-chatgpt-xu-hao.html

Teaching test-driven development

Starts by designing a test for the desired behavior(s)
- Ensuring that the test fails on nonfunctional code
Then code is developed that can pass the test
Once the code passes, then it is refactored to make it more robust
TDD can ensure that AI-generated code performs as advertised
- But you have to know how to generate the appropriate tests!

Conclusions

AI coding assistants can greatly improve programmer effectiveness
- But they require care and verification
AI coding tools can be very useful for coding education
- We need to focus more on training of logical analysis, testing, and workflow design

The Poldrack Lab

Collaborators

Gasper Begus and Thomas Lu, UC Berkeley

How good is the code generated by GPT-4?

Clean coding and refactoring

“Clean code is simple and direct. Clean code reads like well-written prose.” - Grady Booch (from Martin, Clean Code)

Clean code uses consistent coding standards
Clean code minimizes complexity
Refactoring
- Restructuring existing code to make it cleaner or more maintainable without changing its behavior

Experiment 2: Refactoring with GPT-4

Downloaded 2130 Python files from Github (via API)
- Only one file per user
- 272 remained after filtering to avoid autogenerated code and other issues
Prompt (via GPT-4 API): “Please refactor the following Python code to be more readable, adding or rewriting comments as needed.”
Code quality for original and refactored code assessed using flake8 and radon

With Gašper Beguš and Thomas Lu, UC Berkeley

Refactoring success

100% of prompts produced code
97% were free of syntax errors (vs 100% of original)

GPT-4 improves code formatting

flake8 linter identifies code formatting problems
Refactored code from GPT4 produces many fewer warnings (per line of code)
- .24 for original vs .08 for GPT4 (FDR p<.001)

GPT-4 reduces code complexity

Reduced # of logical code lines

Reduced cyclomatic complexity

Both FDR p < .01

Halsted metrics

Basic quantities:
- \(\eta_1\): # distinct operators
- \(\eta_2\): # distinct operands
- vocabulary size: \(\eta = \eta_1 + \eta_2\):
- \(N_1\): # total operators
- \(N_2\): # total operands
- program length: \(N = N_1 + N_2\)

Derived measures:
- Volume: \(V = N \times log_2 \eta\)

Maintainability index

Function of 4 other metrics:
- Halstead volume (\(V\))
- Cyclomatic complexity (\(CC\))
- Source lines of code (\(L\))
- Percent of comment lines (converted to radians) (\(PC\))

\[ MI = \] \[ max[0, 100 \times \frac{171 − 5.2 ln(V) − 0.23 CC − 16.2 ln(L) + 50 sin(\sqrt{2.4PC}))}{171}] \]