Explorables

What Have Language Models Learned?

By asking language models to fill in the blank, we can probe their understanding of the world.

Large language models are making it possible for computers to write stories, program a website and turn captions into images.

One of the first of these models, BERT, is trained by taking sentences, splitting them into individual words, randomly hiding some of them, and predicting what the hidden words are. After doing this millions of times, BERT has “read” enough Shakespeare to predict how this phrase usually ends:

56.987% question 3.610% difference 3.004% answer 2.691% problem 2.623% key 0.954% challenge 0.899% truth 0.743% game 0.719% point 0.678% definition 0.618% riddle 0.584% idea 0.576% dilemma 0.568% message 0.555% phrase 0.521% principle 0.497% choice 0.463% distinction 0.449% decision 0.435% thing 0.408% debate 0.401% issue 0.368% relationship 0.350% ultimatum 0.316% way 0.313% law 0.309% simplest 0.298% simple 0.281% deal 0.281% word 0.250% case 0.249% balance 0.227% concept 0.222% matter 0.209% rule 0.201% moral 0.197% philosophy 0.191% argument 0.183% equation 0.180% story 0.177% line 0.171% essence 0.170% goal 0.162% situation 0.157% theme 0.151% subject 0.149% logic 0.143% conversation 0.139% verb 0.138% attitude 0.132% quest 0.131% difficulty 0.131% test 0.129% right 0.129% resolution 0.120% spirit 0.119% relation 0.115% same 0.113% saying 0.110% purpose 0.108% criterion 0.108% statement 0.105% questions 0.105% one 0.103% opposite 0.102% motto 0.101% expression 0.095% trouble 0.093% reason 0.090% play 0.087% lesson 0.085% policy 0.084% theory 0.084% asking 0.079% basic 0.078% contract 0.077% order 0.074% objective 0.073% condition 0.072% alternative 0.071% trick 0.070% meaning 0.068% context 0.066% first 0.065% contest 0.063% interview 0.063% metaphor 0.062% poem 0.062% talk 0.061% discussion
BERT's predictions for what should fill in the hidden word

This page is hooked up to a version of BERT trained on Wikipedia and books.¹ Try clicking on different words to see how they’d be filled in or typing in another sentence to see what else has BERT picked up on.

Cattle or Clothes?

Besides Hamlet’s existential dread, the text BERT was trained on also contains more patterns:

 7.299% things 4.786% beer 3.883% it 2.917% horses 2.787% them 2.103% coffee 2.000% everything 1.799% stuff 1.758% cattle 1.343% land 1.295% alcohol 1.264% wine 1.209% guns 1.055% food 1.029% drugs 0.976% property 0.966% in 0.962% cars 0.933% clothes 0.869% tobacco 0.834% up 0.798% houses 0.795% out 0.783% time 0.756% more 0.736% shoes 0.731% cotton 0.722% liquor 0.708% cigarettes 0.677% oil 0.601% whiskey 0.580% books 0.557% furniture 0.531% fish 0.520% something 0.486% salt 0.475% people 0.474% flowers 0.438% corn 0.421% marijuana 0.420% groceries 0.414% beef 0.405% drinks 0.365% wood 0.362% shit 0.362% newspapers 0.344% goods 0.344% sugar 0.341% pork 0.319% tea 0.310% bread 0.303% cows 0.299% paper 0.295% themselves 0.272% water 0.260% boots 0.259% vegetables 0.255% blood 0.247% meat 0.234% gas 0.230% beans 0.225% these 0.225% well 0.223% antiques 0.217% here 0.209% weed 0.202% rice 0.198% coal 0.198% grapes 0.195% from 0.194% music 0.192% popcorn 0.190% texas 0.186% money 0.185% slaves 0.184% by 0.178% and 0.177% tickets 0.173% around 0.171% together 0.170% fruit 0.161% milk 0.159% weapons 0.159% cards 0.157% candy 0.156% away 0.155% watches 0.155% paintings 0.155% spirits 0.154% trees

Cattle and horses aren’t top purchase predictions in every state, though! In New York, some of the most likely words are clothes, books and art:

20.794% things 4.351% clothes 4.147% it 3.487% books 2.935% them 2.885% everything 2.700% stuff 1.343% more 1.234% art 1.151% out 1.079% shoes 1.051% up 0.951% food 0.905% coffee 0.855% in 0.734% drugs 0.695% something 0.679% furniture 0.668% time 0.645% houses 0.612% themselves 0.591% cars 0.557% together 0.548% paintings 0.532% goods 0.476% tea 0.474% newspapers 0.459% fish 0.441% pictures 0.433% tickets 0.420% antiques 0.411% clothing 0.409% perfume 0.408% groceries 0.402% people 0.397% jewelry 0.382% property 0.372% cigarettes 0.372% flowers 0.370% shit 0.369% money 0.328% music 0.312% beer 0.302% wine 0.301% anything 0.300% underwear 0.297% paper 0.292% toys 0.252% ideas 0.250% liquor 0.240% movies 0.226% better 0.222% horses 0.217% from 0.213% records 0.212% trash 0.209% alcohol 0.202% hats 0.198% well 0.194% watches 0.192% candy 0.191% and 0.186% bread 0.184% instead 0.182% too 0.180% drinks 0.177% fast 0.174% suits 0.167% products 0.165% magazines 0.163% new 0.163% guns 0.161% . 0.159% here 0.157% sex 0.156% papers 0.156% down 0.151% vegetables 0.147% meat 0.146% these 0.143% information 0.140% stories 0.136% hips 0.136% items 0.136% back 0.136% space 0.135% by 0.134% nothing 0.134% sweets 0.133% away

There are more than 30,000 words, punctuation marks and word fragments in BERT’s vocabulary. Every time BERT fills in a hidden word, it assigns each of them a probability. By looking at how slightly different sentences shift those probabilities, we can get a glimpse at how purchasing patterns in different places are understood.

Number of Tokens
30
200
1000
5000
All
Chart Type
Likelihoods
Differences
Model
BERT
Zari
Update
⚠️Some of the text this model was trained on includes harmful stereotypes. This is a tool to uncover these associations—not an endorsement of them.
Reset
45678910114567891011 __ likelihood, New York sentence → __ likelihood, Texas sentence →
BERT associates these potential purchases more with Texas
than New York...
...and these purchases
more with New York
than Texas
itfromupouttimemorethempeopleofflifemusicsomethingwaterservicetogetherpowerartredlandthingsnothinganythinginformationeitherwordsbloodeverythingmoneybooksfoodmyselfenergypartstrackspropertythemselvesglasscommissionsexstorieshousesoilwoodtreesproductscarsgasfishliteraturesweeteasilycoffeemedicinebodiesclothesseatsdoorsspiritgunsyourselfshitwinepicturesvehiclesideasfashionflowersstuffhorsesobjectspaintingscoalgoodsbeeropinionsaltteamoviesinstrumentsdrugsshoessuppliesaccountsnakedjeansnewspapersadvertisingalcoholcottoncargobootsbuffalocoppercattleslavesatmospheremagazinesguitarsfurnitureconstantlycustomsartworkestatesammunitiontrucksgiftsapartmentsportraitsticketsmachinerybulletsopinionsblackslabelschemicalsridesbeefjokesliquorgasolinewhiskeylumberbricksunderwearadvertisementsdollslaundrycigarettesbourbongumfireworksgrapesaccessoriescowsherbsfrogsjointsweedheroinstallsshrimpspicesgreasefabricsgluesweetsantiquescondoms

You can edit these sentences. Or try one of these comparisons to get started: Ireland v. AustraliaArctic v. EquatorCoast v. PlainsNarnia v. GothamSupermarket v. Mall

To the extent that a computer program can “know” something, what does BERT know about where you live?

What’s in a Name?

This technique can also probe what associations BERT has learned about different groups of people. For example, it predicts people named Elsie are older than people named Lauren:

Number of Tokens
30
200
1000
5000
All
Chart Type
Likelihoods
Differences
Model
BERT
Zari
Update
⚠️Some of the text this model was trained on includes harmful stereotypes. This is a tool to uncover these associations—not an endorsement of them.
Reset
4567891045678910 __ likelihood, Elsie sentence → __ likelihood, Lauren sentence →
35threeseconddaylast2010marchjuneseptember2011julyjanuary20122009december2013200720062014202015february201630200520172000200320021922199919971996291995199219931990198919881987198619841980198319821979198119761978197419751977197019451971196919671964194419631962194219431958194119571946195019521947195319491951unknownsixth19301933192919281932193119241915seventh19001905971889891880186118841860187618671877187218501857185218401849nineteenth183018001836150018151834183318251810190182617932020180917981789179917961651775179117805012051681770175020320617602041768207214911216231176917621766227175929517401715175717301720175317103211661164217461680175217071741172716431725202117421744173517241722173717181713173316831635165416651704171916591703165617291697165215701709166316181560

It’s also learned that people named Jim have more typically masculine jobs than people named Jane:

Number of Tokens
30
200
1000
5000
All
Chart Type
Likelihoods
Differences
Model
BERT
Zari
Update
⚠️Some of the text this model was trained on includes harmful stereotypes. This is a tool to uncover these associations—not an endorsement of them.
Reset
7891011121378910111213 __ likelihood, Jim sentence → __ likelihood, Jane sentence →
whilechildcoachbankmodelseniorstudentproducersecretarymanagerwritersingerguarddoctorjudgeteacherprincipaldriverpilotrestaurantjournalistguidepriestlawyersoldierhistorianpainterrefereemediummusicianrunnercookdesignerguitaristsongwriterdetectivereportersheriffpublishernursebusinessmanvolunteerbutlerscientistworkerfarmerphysiciantouristconductorclerkdancerphotographermissionaryconsultantpianistperformercollectorcheftrainerboxercarpenterteenagerrapperpotterresearchermentormaiddealershepherdplaywrighttranslatorrangersailormessengersupervisorbuilderpsychicbarkerbarberbankercontractorlibrarianwaitressweavergppolicemanpreachertutorbutcherpsychologistprinterpromoterwaiterdinerbartendermechanicclownbodyguardcutterviolinistdressermagiciancounselorchoreographercaretakerbuyertraderaccountantstewardcourierpsychiatristminersalesmannannyprostitutecinematographertherapistcartoonisthealergardenerfishermancleanerprogrammertellergeologistbiologistbookstoregoldsmithtailordraperreceptionistdentistplannercantor

These aren’t just spurious correlations — Elsies really are more likely to be older than Laurens.² And occupations the model associates with feminine names are held by a higher percentage of women.

Should we be concerned about these correlations? BERT was trained to fill in blanks in Wikipedia articles and books — it does a great job at that! The problem is that the internal representations of language these models have learned are used for much more – by some measures, they’re the best way we have of getting computers to understand and manipulate text.

We wouldn’t hesitate to call a conversation partner or recruiter who blithely assumed that doctors are men sexist, but that’s exactly what BERT might do if heedlessly incorporated into a chatbot or HR software:

Number of Tokens
30
200
1000
5000
All
Chart Type
Likelihoods
Differences
Model
BERT
Zari
Update
⚠️Some of the text this model was trained on includes harmful stereotypes. This is a tool to uncover these associations—not an endorsement of them.
Reset
345678910345678910 __ likelihood, nurse sentence → __ likelihood, doctor sentence →
aklmtitherhimjohnhimselfdavidmrthomasmichaelcharlesdrsirprofessorrichardmarybilltomdoctorsamfrankalexandermrsjeanchrismariadanielalextonyrichsarahwalterjanecharlietimmorganmargaretjanrachelemmajonathankateannmstyseandickgabrielnicholasrickcooperemilysusanphilhelensebastianjuliablakelarryterrycatherinesethkeithbernardcollinslauradennistedhanslisaaaronsullivanharveyolivialeonardannieeugenelouisehannahernestlindamaggiecarolmarthaevanedgardantecalebfrsanchezgloriaprofelibenedictwendyrustymarceldexterabereeseeliotasasilasotislilazacharysvengarthtimmyfinleyzac

Adjusting for assumptions like this isn’t trivial. Why machine learning systems produce a given output still isn’t well understood – determining if a credit model built on top of BERT rejected a loan application because of gender discrimation might be quite difficult.

Deploying large language models at scale also risks amplifying and perpetuating today’s harmful stereotypes. When prompted with “Two Muslims walked into a…”, for example, GPT-3 typically finishes the sentence with descriptions of violence.

How Can We Fix This?

One conceptually straightforward approach: reduce unwanted correlations from the training data to mitigate model bias.

Last year a version of BERT called Zari was trained with an additional set of generated sentences. For every sentence with a gendered noun, like boy or aunt, another sentence that replaced the noun with its gender-partner was added to the training data: in addition to “The lady doth protest too much,” Zari was also trained on “The gentleman doth protest too much.”

Number of Tokens
30
200
1000
5000
All
Chart Type
Likelihoods
Differences
Model
BERT
Zari
Update
⚠️Some of the text this model was trained on includes harmful stereotypes. This is a tool to uncover these associations—not an endorsement of them.
Reset
56789101112135678910111213 __ likelihood, doctor sentence → __ likelihood, nurse sentence →
iheitshetheywepeoplebothdavidsomeonejackeveryonemomdadgrayalexandrewtonynickhuntersarahryandeanjakechaserachelemmatyclaireemilyloganblakemarcusashconnorzanejenniferkarenamandamiakylieheatherravenkatjessiefionaashecolton

Unlike BERT, Zari assigns nurses and doctors an equal probability of being a “she” or a “he” after being trained on the swapped sentences. This approach hasn’t removed all the gender correlations; because names weren’t swapped, Zari’s association between masculine names and doctors has only slightly decreased from BERT’s.³ And the retraining doesn’t change how the model understands nonbinary gender.

Something similar happened with other attempts to remove gender bias from models’ representations of words. It’s possible to mathematically define bias and perform “brain surgery” on a model to remove it, but language is steeped in gender. Large models can have billions of parameters in which to learn stereotypes — slightly different measures of bias have found the retrained models only shifted the stereotypes around to be undetectable by the initial measure.

As with other applications of machine learning, it’s helpful to focus instead on the actual harms that could occur. Tools like AllenNLP, LMdiff and the Language Interpretability Tool make it easier to interact with language models to find where they might be falling short. Once those shortcomings are spotted, task specific mitigation measures can be simpler to apply than modifying the entire model.

It’s also possible that as models grow more capable, they might be able to explain and perform some of this debiasing themselves. Instead of forcing the model to tell us the gender of “the doctor,” we could let it respond with uncertainty that’s shown to the user and controls to override assumptions.

Credits

Adam Pearce // July 2021

Thanks to Ben Wedin, Emily Reif, James Wexler, Fernanda Viégas, Ian Tenney, Kellie Webster, Kevin Robinson, Lucas Dixon, Ludovic Peran, Martin Wattenberg, Michael Terry, Tolga Bolukbasi, Vinodkumar Prabhakaran, Xuezhi Wang, Yannick Assogba, and Zan Armstrong for their help with this piece.

Footnotes

¹ The BERT model used on this page is the Hugging Face version of bert-large-uncased-whole-word-masking. “BERT” also refers to a type of model architecture; hundreds of BERT models have been trained and published. The model and chart code used here are available on GitHub.

² Notice that “1800”, “1900” and “2000” are some of the top predictions, though. People aren’t actually more likely to be born at the start of a century, but in BERT’s training corpus of books and Wikipedia articles round numbers are more common.

³Comparing BERT and Zari in this interface requires carefully tracking tokens during a transition. The BERT Difference Plots colab has ideas for extensions to systemically look at differences between the models’ output.

This analysis shouldn’t stop once a model is deployed — as language and model usage shifts, it’s important to continue studying and mitigating potential harms.

Appendix: Differences Over Time

In addition to looking at how predictions for men and women are different for a given sentence, we can also chart how those differences have changed over time:

In $year, he worked as a ___.
In $year, she worked as a ___.
18601880190019201940196019802000-5+0+5motherdirectorministercoachprofessormodelstudentproducersecretarymanagerwritersingerdoctorjudgepoliticianteacherprincipaldriverfootballerpilothuntercomposercriminalpoetjournalistguidepriestlawyersoldierpainterrefereemediummusicianrunnercookdesignersongwriterbakerdetectivereportermerchantsheriffdjpublishermasonnursebusinessmanvolunteerdrummerworkerscoutfarmerphysiciantouristconductorclerkrabbiservantvocalistdancerphotographerspymissionarypianistlecturerpastorperformersurgeoncomediannovelistcollectorchefsculptorboxersopranofeministcarpenterteenagerpotterwrestlerresearchermaiddiplomatshepherdscreenwriterplaywrightcorrespondenttranslatorcowboysailormessengerprosecutorbuildermathematiciancolumnistbarberbankercontractorlibrarianwaitressweavercyclistmagistratepolicemanpreachertutorbutcherpsychologistprinterchemistwaiterbartendermechanicviolinistnuntechnicianmagiciansurveyorchoreographertraderstewardcourierpsychiatristminerbrewersalesmannannyprostitutecinematographertherapistcartoonisthealergardenerbarristerfishermancleanerblacksmithtellergeologistseamanmercenarygoldsmithhostesshousekeepertailorreceptionistdentistcantorprivateer
In $year, he studied ___.
In $year, she studied ___.
18601880190019201940196019802000-2+0+2itatthereunderfilmagainheremusicgovernmenttogetherlawvoicehistoryenglishartfrenchgermanbusinesshardfurtherfrancedesignsciencegermanychineseartsrussianguitarchinaeuropejapanesejapanmanagementparistheatrespanishwritingitaliandanceitalygreekengineeringrussiapianodramalatinactingoperaliteratureberlinletterslanguagessciencespaintingromemedicineteachingtheaterarchitecturepoliticseconomypoetryreligionphilosophybrieflysingingdrawingtennisagricultureegyptfinancetranslationminingcompositioneconomicsphysicsmathematicsdancingarabicorganphotographysurgeryviennacommercechemistryadvertisingviolinsculpturestatisticspsychologyhebrewabroadoverseasislamballetbiologychessmunichinternationallytheologyjournalismshakespearenursingboxingcookingmathfluteprivatelyconductingaccountingclassicscellogeographysanskritbuddhismhumanitiessociologyarchaeologyanthropologyastronomyforestryyogaanatomygeologyjudaismpharmacygymnasticslinguisticsfencingphysiologydivinitybotanyceramicsdiplomacyrhetoricpsychiatryjudopathologymassagezoologybiochemistryengravingneurosciencefilmmakingjurisprudencecosmeticsdentistrymarxismphilologyembroidery
Born in $year, his name was ___.
Born in $year, her name was ___.
18601880190019201940196019802000-5+0+5johnjameswilliamgeorgedavidpaulthomasrobertmichaelcharlespeterhenryrichardmaryjackchristianrosemartinjosephlouisedwardfrankalexanderjeanmariadanielbenvictoriaelizabethalexsimonandrewmaxtonyadamunknownarthurharryericbrianstephengracesarahfrancislewisannapatrickianannewalteralbertjaneanthonybillymariealanvictorphilipsamuelfrederickmatthewmargaretrachelemmaalicejonathanchristopherkatecharlottegabrielnicholasalfredjacobclaireemilysusanhelenkarlsebastianjuliabenjaminjulianbarbaraleoncatherineleolauralucylisalilynathanharoldivansaraoscarjulieolivianancyadrianruthcarolinemauriceanniejessicaeugenelouisesophieamandafelixhannahernestvladimirdianajoshuaellenmarthaevahugorebeccakatherinemollysophiajanetdorothyangelaalexandraeleanorhectorpatriciachristinestephanienoraireneevelynjudithveronicalydialenaestherbeatricedorispaulinejosephinesoniatobiasmiriampamelajacquelinelilliangenevieve
The top 150 “he” and “she” completions in years from 1860-2018 are shown with the y position encoding he_logit - she_logit. Run in Colab →

The convergence in more recent years suggests another potential mitigation technique: using a prefix to steer the model away from unwanted correlations while preserving its understanding of natural language.

Using “In $year” as the prefix is quite limited, though, as it doesn’t handle gender-neutral pronouns and potentially increases other correlations. However, it may be possible to find a better prefix that mitigates a specific type of bias with just a couple of dozen examples.

In $year, they worked as a ___.
In $year, she worked as a ___.
18601880190019201940196019802000-10+0+10schoolteamfilmfamilygroupcompanyclubsinglebandmotherbusinessdirectorlivingunionmuseumministercoachwholebankmagazineprofessormodelstudiotheatrestudentunitproducersecretarymanagerwritersingercouplehotelbardoctorcrewjudgenewspaperfirmteachercorporationfarmgallerypairdrivertheaterpilotdocumentarycomposerfactorypoetrestaurantjournalistlawyerpainterpartnershipmediummusicianrunnercookcommunecollaborationdesignersongwritergangbakerdetectivereportermerchantjurydjchoirpublishernurseduovolunteertriochoruscollectivescoutfarmerphysiciantouristconductorcafeclerkservantdancerphotographerquartetspymissionarypianistlecturerpastorperformersurgeoncircuscomediannovelistchefsculptorboxercarpenterpotterresearchercooperativemaiddiplomatscreenwriterplaywrighttranslatorduetbrewerybarberbankercontractorlibrarianwaitressweavertutorpsychologistprinterchemistwaiterbartendermechanictroupeviolinistnunsurveyorchoreographertraderhobbycourierquintetbakerytandemsalesmannannyprostitutehealergardenercleanerbookstorehostesshousekeepertailorreceptionistdentist
In $year, he ___ a bear.
In $year, she ___ a bear.
18601880190019201940196019802000-1+0+1wasishadhasmadedidbornbecametookcalledfoundwonreleasedheldlocatedgotnamedstartedbuiltwantedreceivedfeltsawgavelostmarriedproducedcreatedheardhitkilledneededmetshotkeptbitraisedownedintroducedcaughtdefeatedhelpedwatchedgrewclaimeddiscoveredacquiredrecognizedloveddrewdestroyedburiedacceptedgrabbedfoughtcapturedboughtbecomespurchasedadoptedfacednoticedidentifiedattackedobtainedvisitedtrainedmissedkissedfindschosestruckpaintedtouchedcollectedhiredobservedinjuredwoundeddeliveredapproachedsavedmountedseesmeetsdatedattractedescapedbearskickedmurderedspottedfedswallowedtrappedcarvedinheritedrodeborekillsencounteredimprisonedpossessedatediscoverspursuedwitnessedconceivedrescuedhuggedstolefreedwelcomedhidcatchesencounterskidnappedshootssummoneddrownedtrackedconfrontedsavesstabbedchasedembracedhuntedscratchedexcavatedstalkedphotographedshelteredcapturesbattledsightedsacrificedabductedwrestledstrangledwedslaughteredrearedbuysbefriendedcradledcorneredrescuestattooedsired
In $year, he played a game of ___.
In $year, she played a game of ___.
18601880190019201940196019802000-4-2+0+2+4ithousewarlifegofourfivedeathkingfootballlawgameslondonfirewordsgodmatchtencanadaguitarshotbasebridgesportsninetwentydiechancebasketballmatchesrulesspainbaseballballdefensepianofishhockeyriskmagiccardrugbysilencepoliticscombatsportcricketcatchchecksoccerdefencepoolartilleryshootingtenniscardshidegolffishingstrategyrobindebatedogstagindianstablesfortunederbylightningchessvolleyballquarterslogicchickenmouseskillheartshookdarepiratesboxinghamletsacrificebatmanbowlingtriangletwistpretendoddsswordshurlingpolodotgaelicalgebracurlingpokerpatiencedodgesurvivorhandballsoftballsnakestileswimbledonlacrossesoldollsbadmintondartlotterybowlsmonopolyhazardfencingbanjokarateguessinghalobattleshipwillsdartsdiplomacyginfetchdicesquashqioriolespeekrumsevensslotspokemoncamouflagearcheryriddleblitzgettysburgpuckdominonetballwitsscissorssnookerjeopardybingosumomayfair

Closer examination of these differences in differences also shows there’s a limit to the facts we can pull out of BERT this way.

Below, the top row of charts shows how predicted differences in occupations between men and women change between 1908 and 2018. The rightmost chart shows the he/she difference in 1908 against the he/she difference in 2018.

The flat slope of the rightmost chart indicates that the he/she difference has decreased for each job by about the same amount. But in reality, shifts in occupation weren’t nearly so smooth and some occupations, like accounting, switched from being majority male to majority female.

This reality-prediction mismatch could be caused by lack of training data, model size or the coarseness of the probing method. There’s an immense amount of general knowledge inside of these models — with a little bit of focused training, they can even become expert trivia players.

More Explorables

Hidden Bias

Models trained on real-world data can encode real-world bias. Hiding information about protected classes doesn't always fix things — sometimes it can even hurt.

Are Model Predictions Probabilities?

Machine learning models express their uncertainty as model scores, but through calibration we can transform these scores into probabilities for more effective decision making.

Can Large Language Models Explain Their Internal Mechanisms?

An interactive introduction to Patchscopes, an inspection framework for explaining the hidden representations of large language models, with large language models.

Searching for Unintended Biases With Saliency

Machine learning models sometimes learn from spurious correlations in training data. Trying to understand how models make predictions gives us a shot at spotting flawed models.

Measuring Fairness

There are multiple ways to measure accuracy. No matter how we build our model, accuracy across these measures will vary when applied to different groups of people.

Measuring Diversity

Search results that reflect historic inequities can amplify stereotypes and perpetuate under-representation. Carefully measuring diversity in data sets can help.

Can a Model Be Differentially Private and Fair?

Training models with differential privacy stops models from inadvertently leaking sensitive data, but there's an unexpected side-effect: reduced accuracy on underrepresented subgroups.

How Federated Learning Protects Privacy

Most machine learning models are trained by collecting vast amounts of data on a central server. Federated learning makes it possible to train models without any user's raw data leaving their device.

From Confidently Incorrect Models to Humble Ensembles

ML models sometimes make confidently incorrect predictions when they encounter out of distribution data. Ensembles of models can make better predictions by averaging away mistakes.

Datasets Have Worldviews

Every dataset communicates a different perspective. When you shift your perspective, your conclusions can shift, too.

Collecting Sensitive Information

The availability of giant datasets and faster computers is making it harder to collect and study private information without inadvertently violating people's privacy.

Do Machine Learning Models Memorize or Generalize?

An interactive introduction to grokking and mechanistic interpretability.

Why Some Models Leak Data

Machine learning models use large amounts of data, some of which can be sensitive. If they're not trained correctly, sometimes that data is inadvertently revealed.