Automated interpretability goals to translate massive language mannequin (LLM) options into human comprehensible descriptions. Nevertheless, these pure language characteristic descriptions are sometimes imprecise, inconsistent, and require handbook relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM options. By combining primitives that seize linguistic and semantic characteristic patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce exact and expressive characteristic descriptions. Throughout quantitative benchmarks and qualitative analyses, we discover that semantic regexes match the accuracy of pure language whereas yielding extra concise and constant characteristic descriptions. Furthermore, their inherent construction affords new varieties of analyses, together with quantifying characteristic complexity throughout layers, scaling automated interpretability from insights into particular person options to model-wide patterns. Lastly, in person research, we discover that semantic regex descriptions assist individuals construct correct psychological fashions of LLM characteristic activations.
