Machine learning versus human learning: Basic units and form-meaning mapping
Stela Manova
November 2024
 

To properly understand how a Large Language Model (LLM) learns and processes language, it is essential not to force a linguistic logic on it but to start from that model’s architecture. Based on the latter, parallels with language learning by humans should be sought. This article introduces the first component of an LLM, the tokenizer, and compares the various tokenization steps with language structure, as known from linguistic research, and with language learning by humans, specifically with well-known facts from first language acquisition (L1). Such an approach highlights unexpected similarities between machines and humans with respect to language. Given that an LLM does not operate with words or semantics, both traditionally seen as core elements of human language and cognition, the focus is on basic units and form-meaning mapping. The latest version of the GPT tokenizer, o200k_base, serves as a primary source of data.
Format: [ pdf ]
Reference: lingbuzz/008548
(please use that when you cite this article)
Published in: Submitted for inclusion in Vsevolod Kapatsinski and Gašper Beguš (eds.), Implications of Neural Networks and Other Learning Models for Linguistic Theory. Special issue of Linguistics Vanguard.
keywords: machine learning, human learning, natural language processing, large language models, chatgpt, tokenization, linguistic theory, first language acquisition, form-meaning mapping, words, tokens, syntax, phonology, semantics, morphology
previous versions: v2 [November 2024]
v1 [November 2024]
Downloaded:1517 times

 

[ edit this article | back to article list ]