AI-assisted coding: Experiments with GPT-4

Russ Poldrack

Professor, Department of Psychology

Associate Director, Stanford Data Science

Stanford University

How many of you have used AI to help with coding?

Today’s punch lines

  • Large language models (LLMs) are powerful coding assistants, but have important shortcomings
    • If you are not using them to help write code, you are probably throwing away time
  • LLMs pose fundamental questions about how to educate and to assess learning of coding skills

https://arxiv.org/abs/2304.13187

Large language models

https://jalammar.github.io/how-gpt3-works-visualizations-animations/

Vaswani et al., 2017

OpenAI Codex

  • Released August 2021
  • Based on GPT-3
    • Trained on a large body of publicly available source code including 150+ GB of Python code from 54M public GitHub repositories
  • “Our results show that Codex performs better than most students on code writing questions in typical first year programming exams” (Finnie-Ansley, 2022)
  • Deprecated in March 2023 in favor of GPT-3.5

Github Copilot

GPT-4 codes at human level (Bubeck et al. 2023)

  • AI systems were given problems from the LeetCode platform for coding practice/testing
    • only problems posted after the GPT-4 training period ended (Sept 2021)
  • GPT-4 performs on par with human LeetCode users

Incremental improvements on GPT-4

  • Since early 2023 improvements have been incremental
    • Most have been based on GPT-4 with improved prompting techniques or additional training data
  • GPT-4 still dominates but open-source models are catching up

HumanEval: 164 coding problems with 8 tests each https://paperswithcode.com/sota/code-generation-on-humaneval

Why coding is an optimal use case for LLMs

  • LLMs can sometimes “hallucinate”

Why coding is an optimal use case for LLMs

  • LLMs can sometimes “hallucinate”

Why coding is an optimal use case for LLMs

Why coding is an optimal use case for LLMs

Why coding is an optimal use case for LLMs

  • LLMs are most useful in cases where work is difficult to generate but relatively easy to verify
    • It’s trivially easy to test whether code is syntatically correct
    • It’s usually straightforward to generate automated tests to determine whether code is functioning correctly

Experiments with GPT-4

  • Experiments performed in March 2023
    • Before multimodal support
    • Before supposed degradation of GPT-4 performance in mid-2023 (Chen et al, 2023)

Experiment 1: Interactive coding with GPT-4

  • Presented GPT-4 (via ChatGPT Plus) with 32 coding problems
    • Primarily focused on data science/statistics/ML and text mining
  • Caveats:
    • User was fairly novice with prompt engineering
    • Assessment was completely subjective and qualitative

GPT-4 coding performance

  • Median 69 lines of code (31-161)
  • “Success” defined as running with no errors and giving a result that was reasonably close to intended result (by my subjective judgment)
  • “NS” meant that a relatively novice user (i.e. me) was not able to easily prompt a successful solution (i.e. w/in a few mins of effort)

Example of GPT-4 one-shot success

Example of GPT-4 one-shot success

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline

# 1. Generate synthetic data
np.random.seed(42)
mean = [0, 0]
cov = [[1, 0.5], [0.5, 1]]
train_data = np.random.multivariate_normal(mean, cov, 32)
test_data = np.random.multivariate_normal(mean, cov, 32)

X_train = train_data[:, 0].reshape(-1, 1)
y_train = train_data[:, 1]

X_test = test_data[:, 0].reshape(-1, 1)
y_test = test_data[:, 1]

# 2. Fit three models
linear_model = LinearRegression().fit(X_train, y_train)
second_order_model = make_pipeline(PolynomialFeatures(2), LinearRegression()).fit(X_train, y_train)
ninth_order_model = make_pipeline(PolynomialFeatures(9), LinearRegression()).fit(X_train, y_train)

# 3. Compute errors
models = [linear_model, second_order_model, ninth_order_model]
train_errors = [mean_squared_error(y_train, model.predict(X_train)) for model in models]
test_errors = [mean_squared_error(y_test, model.predict(X_test)) for model in models]

print("Training errors:", train_errors)
print("Test errors:", test_errors)

# 4. Plot fitted lines
plt.scatter(X_train, y_train, color='blue', label='Training data')

X_line = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
colors = ['green', 'red', 'purple']
labels = ['Linear', '2nd-order', '9th-order']

for i, model in enumerate(models):
    plt.plot(X_line, model.predict(X_line), color=colors[i], label=labels[i])

plt.legend()
plt.xlabel('Variable 1')
plt.ylabel('Variable 2')
plt.title('Fitted lines for different models')
plt.show()

Example of GPT-4 one-shot success

Training errors: [0.5941676382175596, 0.5933518294151625, 0.5129192853729939]
Test errors: [0.7419748200389997, 0.722912504049291, 0.9712589627709195]

GPT-4 is quite good at explaining the conceptual intent of code

GPT-4 explanation

GPT-4 explanation

GPT-4 explanation

GPT-4 explanation

GPT-4 explanation

GPT-4 explanation

Interactive debugging with GPT-4

  • 11/32 attempts required additional prompting to solve the problem
  • Common problems
    • Use of outdated package features
    • Type errors (e.g. accessing a list as if it were a dict)
    • Hallucinating non-existent packages or methods
    • Use of inappropriate methods for particular data

An example of interactive debugging

An example of interactive debugging

self.continuous_model = LinearRegression()

An example of interactive debugging

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1

An example of interactive debugging

An example of interactive debugging

  • Initial comparison using in-sample train/test data
$ python hurdle4.py 
Mean squared error: 7.750558995547416
$ Rscript hurdle.R
[1] "Mean squared error: 108.965835640358"
  • After additional manual (human only) debugging:
$ python hurdle5.py 
Mean squared error: 5.198941931090807
$ Rscript hurdle_insample.R 
[1] "Mean squared error: 5.04963049849389"

LLMs inherit the stupidity of crowds

  • GPT-4 estimates the hurdle model in an incorrect way
    • All other Python implementations of the model on Github use this incorrect approach

GPT-4 coding failures

  • 9/32 attempts did not lead to a successful outcome
    • Number of additional prompts ranged from 1-4
    • One case was a system problem with a required library

Example: The EZ-diffusion model

The EZ-diffusion model for two-choice response time tasks takes mean response time, the variance of response time, and response accuracy as inputs. The model transforms these data via three simple equations to produce unique values for the quality of information, response conservativeness, and nondecision time. (Wagenmakers et al., 2007)

Prompting GPT-4 for the EZ-diffusion model

GPT-4 misunderstanding

result = minimize(ez_diffusion_loss, initial_guess, bounds=bounds)

GPT-4 misunderstanding

result = minimize(ez_diffusion_loss, initial_guess, bounds=bounds)

GPT-4 mathematical hallucination

def ez_diffusion(response_times, decisions):
    accuracy = np.mean(decisions > 0)
    rt_mean = np.mean(response_times)
    rt_var = np.var(response_times)
    
    v = np.sqrt(np.pi) * (4 * accuracy * (1 - accuracy) / (rt_var / rt_mean ** 2 - 1)) ** (1 / 4)
    a = rt_mean * (1 - 2 * accuracy) * v / (4 * accuracy * (1 - accuracy))
    z = a / 2

Correct equation

GPT-4 equation

Can multimodal GPT-4 with web access get it right?

Experiment 1: Summary

  • GPT-4 is often very successful at generating useful code
    • But it sometimes fails spectacularly, especially with math
  • GPT-4 does not provide any signals regarding relative confidence in the answer
    • The matter-of-fact nature of the answers tends to imply confidence
  • All answers need to be verified
    • “Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important” (GPT-4 Technical Report)

Experiment 3: Generating code and tests

  • GPT-4 used to programmatically generate 20 coding problems in each of 5 domains (statistical and data science, physics, theoretical computer science, ecology, economics)
    • e.g. “Please generate 20 prompts to ask a chatbot to create Python code to solve a variety of physics problems.”

Example prompt

  • “Create a Python function to model the population growth of a species using the logistic growth equation, given the initial population size, carrying capacity, and growth rate. Please embed the code within an explicit code block, surrounded by triple-backtick markers. Generate realistic values for any examples, and do not use input() commands. Create code that is modular and well-commented. Then, generate a set of pytest tests that exercise each of the functions, embedded in a separate code block.” (Italicized text added)

GPT-4 coding results

  • GPT-4 successfully generated code and tests for each prompt
  • 97% of the resulting code executed successfully without errors

GPT-4 test generation results

  • Code coverage
    • Assessed using Coverage.py
  • 60% of tests had 100% line coverage

GPT-4 testing results

  • All tests successfully executed using pytest
    • All tests passed for 46
    • At least 1 tests failed for 54
  • Primary causes of failure:
    • Failure of assertion (45)
    • Failure to properly raise error (11)

GPT-4 testing results

mass_jupiter = 1.8982e27
radius_jupiter = 6.9911e7
result = escape_velocity(mass_jupiter, radius_jupiter)
assert pytest.approx(result, rel=1e-3) == 59564.97

E       assert 60202.716344497014 ± 6.0e+01 == 59564.97
E         comparison failed
E         Obtained: 59564.97
E         Expected: 60202.716344497014 ± 6.0e+01
  • In this case, the code and test value are both correct, depending on where you stand on Jupiter!
    • NASA’s Jupiter fact sheet claims an escape velocity of 59.5 km/s, which seems to be the source of the test value
    • This is correct when computed using the equatorial radius of 71492 km
    • The value generated by the code (60.2 km/s) is correct when computed using the volumetric mean radius (69911 km)

AI code review

  • Code review is essential for improving code quality
    • But many researchers do not have access to experts to help review their code
  • GPT-4 seems to be quite good at reviewing code
  • This could be a huge win for scientific code quality and coding education

AI code review: An example

  • Intentionally obfuscated Python code
from pandas import *
from numpy import *
from scipy.stats import *
maxD = 12
hc = ['Nervous', 'Hopeless', 'RestlessFidgety', 'Depressed', 'EverythingIsEffort', 'Worthless', ]
h=read_csv('https://raw.githubusercontent.com/poldrack/clean_coding/master/data/health.csv',index_col=0)[hc].dropna().mean(1)
data=read_csv('https://raw.githubusercontent.com/poldrack/clean_coding/master/data/meaningful_variables_clean.csv',index_col=0)
sc=[]
for i in range(data.shape[1]):
    if data.columns[i].split('.')[0][-7:] == '_survey':
        sc=sc+[data.columns[i]]
data=data[sc]
gs=[]
for i in range(data.shape[0]):
    if sum(isnan(data.iloc[i, :])) > 0:
        pass
    else:
        gs=gs+[i]
data=data.iloc[gs,:]
from sklearn.preprocessing import scale
data_sc = scale(data)
from sklearn.decomposition import FactorAnalysis
bicv=zeros(maxD)
for i in range(1,maxD+1):
    fa=FactorAnalysis(i)
    fa.fit(data_sc)
    bicv[i-1]=i*2 - 2*fa.score(data_sc)
npD=argmin(bicv)+1
fa=FactorAnalysis(npD)
f=fa.fit_transform(data_sc)
for i in range(npD):
    print(pearsonr(f[:,i],h[gs]))
    idx=argsort(abs(fa.components_[i, :]))[::-1]
    for j in range(3):
        print(data.columns[idx[j]], fa.components_[i, idx[j]])

https://github.com/poldrack/clean_coding

AI code review: An example

AI code review: An example

Can it find analytic errors?

  • Presented ChatGPT with a (GPT-4) generated python script that performs cross-validated classification, with feature selection improperly performed outside of the crossvalidation loop.
selector = SelectKBest(score_func=f_classif, k=k)
X_selected = selector.fit_transform(X, y)

# Initialize k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
accuracies = []

# Cross-validation loop
for train_index, test_index in kf.split(X_selected):
    X_train, X_test = X_selected[train_index], X_selected[test_index]
    y_train, y_test = y[train_index], y[test_index]
  • Asked it what is wrong with the code (in a separate chat session)

Multimodal coding assistance

AI-assisted coding workflow

  • ChatGPT with GPT-4 is particularly good at generating code de novo from a prompt
    • Except for math where it makes frequent errors
  • Github Copilot is optimal for IDE integration
  • CoPilot chat with GPT-4 now embedded within VSCode
    • Current code is placed into context

AI coding tools will have major implications for science and education in the coming years

How might AI coding tools improve science?

  • Higher quality of research code
    • Code review/refactoring
    • Test generation
    • Generation of simulated data
  • It could enable more effective training of researchers in coding
    • Assisted debugging could help improve debugging skills

How might AI coding tools harm science?

  • Lower quality of research code
    • Generation of complex code with lack of understanding by researcher
  • AI systems cannot be trusted to test themselves

AI-generated code and reproducibility

  • The results of GPT-4 interactions are largely irreproducible unless extra steps are taken
    • It is now possible to set the random seed for inference, but the model can change without warning
    • System fingerprints are available to detect model changes
client.chat.completions.create(
  model='gpt-4',
  seed=seed,
  messages=[
    {'role': 'system', 'content': f'You are a helpful assistant'},
    {'role': 'user', 'content': f'Output a random vegetable.'},],)

# Seed: None
Counter({'Broccoli': 73, 'Carrot': 13, 'Cabbage': 6, 'Cucumber': 4, 'Cauliflower': 3, 'Spinach': 1})

# Seed: 1234
Counter({'Cucumber': 100})

Educational issues around AI-assisted coding

  • How do we assess understanding?
    • It will be increasingly easy for students to complete standard coding problems using AI tools.
  • Do we need to teach coding differently?
    • e.g. greater focus on debugging, testing, and code review
    • Formative assessments using LLM interactions

Finnie-Ansley et al, 2022

Educational issues around AI-generated code

  • We need to focus more on training students to work with AI tools (aka Programming 3.0)
    • What kinds of new interactions do students need to learn?

https://martinfowler.com/articles/2023-chatgpt-xu-hao.html

Teaching test-driven development

  • Starts by designing a test for the desired behavior(s)
    • Ensuring that the test fails on nonfunctional code
  • Then code is developed that can pass the test
  • Once the code passes, then it is refactored to make it more robust
  • TDD can ensure that AI-generated code performs as advertised
    • But you have to know how to generate the appropriate tests!

From prompt engineering to flow engineering

  • AlphaCodium (Ridnik et al., 2024)
    • Starts with a linear reasoning stage in natural language
    • Then moves to an interative generate/run/fix process

Ridnik et al., 2024

From prompt engineering to flow engineering

  • Demonstrates several useful principles
    • LLMs reason better using bullet points (by forcing division into semantically distinct sections)
    • LLMs do better generating modular code
    • Better to generate multiple solutions and compare them
    • Iteratively generate new tests but continue to apply the validated tests

Ridnik et al., 2024

Conclusions

  • AI coding assistants can greatly improve programmer effectiveness
    • But they require care and verification
  • AI coding tools can be very useful for coding education
    • We need to focus more on training of logical analysis, testing, and workflow design

The Poldrack Lab

Collaborators

Gasper Begus and Thomas Lu, UC Berkeley

How good is the code generated by GPT-4?

Clean coding and refactoring

“Clean code is simple and direct. Clean code reads like well-written prose.” - Grady Booch (from Martin, Clean Code)

  • Clean code uses consistent coding standards
  • Clean code minimizes complexity
  • Refactoring
    • Restructuring existing code to make it cleaner or more maintainable without changing its behavior

Experiment 2: Refactoring with GPT-4

  • Downloaded 2130 Python files from Github (via API)
    • Only one file per user
    • 272 remained after filtering to avoid autogenerated code and other issues
  • Prompt (via GPT-4 API): “Please refactor the following Python code to be more readable, adding or rewriting comments as needed.”
  • Code quality for original and refactored code assessed using flake8 and radon

With Gašper Beguš and Thomas Lu, UC Berkeley

Refactoring success

  • 100% of prompts produced code
  • 97% were free of syntax errors (vs 100% of original)

GPT-4 improves code formatting

  • flake8 linter identifies code formatting problems
  • Refactored code from GPT4 produces many fewer warnings (per line of code)
    • .24 for original vs .08 for GPT4 (FDR p<.001)

GPT-4 reduces code complexity

Reduced # of logical code lines

Reduced cyclomatic complexity

Both FDR p < .01

Halsted metrics

  • Basic quantities:
    • \(\eta_1\): # distinct operators
    • \(\eta_2\): # distinct operands
    • vocabulary size: \(\eta = \eta_1 + \eta_2\):
    • \(N_1\): # total operators
    • \(N_2\): # total operands
    • program length: \(N = N_1 + N_2\)
  • Derived measures:
    • Volume: \(V = N \times log_2 \eta\)

Maintainability index

  • Function of 4 other metrics:
    • Halstead volume (\(V\))
    • Cyclomatic complexity (\(CC\))
    • Source lines of code (\(L\))
    • Percent of comment lines (converted to radians) (\(PC\))

\[ MI = \] \[ max[0, 100 \times \frac{171 − 5.2 ln(V) − 0.23 CC − 16.2 ln(L) + 50 sin(\sqrt{2.4PC}))}{171}] \]