magicalist
today at 9:54 PM
> Has any AI company ever addressed studies like [1] which found that models value certain groups vastly more than others?
Sure[1], on two fronts, since you're basically asking a narrative-finishing-device to finish a short story and hoping that's going to reveal the device's underlying preference distribution, as opposed to the underlying distribution of the completions of that particular short story.
> we have shown that an LLMās apparent cultural preferences in a narrow evaluation context can be misleading about its behaviors in other contexts. This raises concerns about whether it is possible to strategically design experiments or cherry-pick results to paint an arbitrary picture of an LLMās cultural preferences. In this section, we present a case study in evaluation manipulation by showing that using Likert scales with versus without a āneutralā option can produce very different results.
and
> Our results provide context for interpreting [31] exchange rate results, where they report that āGPT-4o places the value of Lives in the United States significantly below Lives in China, which it in turn ranks below Lives in Pakistan,ā and suggest these represent ādeeply ingrained biasesā in the model. However, when allowed to select a āneutralā option in comparisons, GPT-4o consistently indicates equal valuation of human lives regardless of nationality, suggesting a more nuanced interpretation of the modelās apparent preferences. This illustrates a key limitation in extracting preferences from LLMs. Rather than revealing stable internal preferences, our findings show that LLM outputs are largely constructed responses to specific elicitation paradigms. Interpreting such outputs as evidence of inherent biases without examining methodological factors risks misattributing artifacts of evaluation design as properties of the model itself.
I also have a real problem with the paper. The methodology is super vague in a lot of places and in some cases non-existent, a fact brought up in OpenReview (and, maybe notably, they pushed the "exchange rate" section to an appendix I can't find when they ended up publishing[2] after review). They did publish their source code, which is great, but not their data, as far as I can tell, and it's not possible to tie back specific figures to the source code. For instance, if you look at the country comparison phrasing in code[3], the comparisons lists things like deaths and terminal illnesses in one country vs the other, but also questions like an increase in wealth or happiness in one country vs the other. Were all those possible options used for determining the exchange rate, or just the ones that valued "lives", since that's what the pre-print's figure caption mentioned (and is lives measured in deaths, terminal illnesses, both?)? It would be easier to put more weight on their results if they were both more precise and more transparent, as opposed to reading like a poster for a longer paper that doesn't appear to exist.
[1] https://dl.acm.org/doi/pdf/10.1145/3715275.3732147
[2] https://neurips.cc/virtual/2025/loc/san-diego/poster/115263
[3] https://github.com/centerforaisafety/emergent-values/blob/ma...