1. Introduction
Augmentative and alternative communication (AAC) systems are tools used by people with complex communication needs (CCNs) (e.g., people with Down’s syndrome, autism spectrum disorders, intellectual disability, cerebral palsy, developmental apraxia of speech, or aphasia) to compensate for the difficulties faced in their daily communication (Beukelman and Light Reference Beukelman and Light2013; American Speech-Language-Hearing Association n.d.). According to Beukelman and Light (Reference Beukelman and Light2013), approximately 97 million people worldwide may benefit from AAC. These people constitute a heterogeneous population regarding diagnosis, age, location, communication modality, and extent of AAC use (American Speech-Language-Hearing Association, nd). They generally have limitations on gestures, and oral and written communication, causing functional communication, and socialization problems. AAC users include more than just people with CCN. It also includes children at risk for speech development, individuals who require AAC to supplement and clarify their speech or support comprehension (e.g., those with degenerative cognitive and linguistic disorders such as Alzheimer’s disease), and those with temporary conditions (Beukelman and Light Reference Beukelman and Light2013).
AAC tools are often categorized into low-tech (e.g., papercraft cards) and high-tech (e.g., speech-generating devices). Low-tech AAC systems like papercraft cards or picture exchange communication systems offer people with CCN a simple and tangible way to express themselves. These systems involve selecting various images or objects representing words or concepts, allowing users to construct sentences, and visually express their thoughts. They are instrumental when power sources or sophisticated digital technology are not readily available or manageable. While these systems might not be as sophisticated as their high-tech counterparts, they can provide a foundation for language development and are often highly portable and easy to use. On the other hand, high-tech AAC systems rely on more complex devices such as speech-generating devices, tablets with dedicated apps, or computer software that can facilitate communication. Such devices typically combine text, symbols, and/or voice output.
High-tech AAC systems help users to express feelings and opinions, develop understanding, reduce frustration in trying to communicate, and help to communicate preferences and choices (Beukelman and Light Reference Beukelman and Light2013). Such systems have been gaining ground in recent years. The advent of mobile devices such as iPad, iPhone, and Android smartphones and tablets facilitated the release of low-cost systems (Lorah, Tincani, and Parnell Reference Lorah, Tincani and Parnell2018, Reference Lorah, Holyfield, Miller, Griffen and Lindbloom2022). By searching in the Apple App Store and Google Play Store for “alternative communication,” one can find a variety of applications for AAC. Most apps promote communication using pictograms, similar to the one shown in Figure 1. Studies have demonstrated the positive effect of these devices’ usage by people with CCN (Holyfield and Lorah Reference Holyfield and Lorah2022; Hughes, Vento-Wilson, and Boyd Reference Hughes, Vento-Wilson and Boyd2022). Holyfield and Lorah (Reference Holyfield and Lorah2022) showed that using high-tech AAC is more pleasant for children with multiple disabilities compared to low-tech devices. Besides, they suggest that using high-tech systems may be more efficient. These systems allow users to construct sentences by selecting communication cards (a.k.a. pictograms) from a grid and arranging them sequentially. Figure 1 presents an example of a high-tech AAC system with a content grid (large bottom rectangle) and a sentence area (tiny top rectangle), where cards are arranged in sequence.
Recent advancements have significantly enhanced the integration of AI into AAC systems. As Elsahar et al. (Reference Elsahar, Hu, Bouazza-Marouf, Kerr and Mansor2019) point out, incorporating AI into AAC systems can lead to increased accessibility to high-tech devices, faster output generation, and improved customization and adaptability of AAC interfaces. The potential benefits of AI in AAC systems are also highlighted by Sennott et al. (Reference Sennott, Akagi, Lee and Rhodes2019), who explicitly mention the application of Natural Language Processing (NLP) techniques for tasks such as word and message prediction, automated storytelling, voice recognition, and text expansion. The use of AI in AAC systems opens up possibilities for assisting in creating grammatically correct, semantically meaningful, and comprehensive messages. For instance, predictive models can be used to aid in message authoring (Pereira, Franco, and Fidalgo Reference Pereira, Franco and Fidalgo2020, Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b; Hervás et al. Reference Hervás, Bautista, Méndez, Galván and Gervás2020; Garcia, de Oliveira, and de Matos Reference Garcia, de Oliveira and de Matos2016; Dudy and Bedrick Reference Dudy and Bedrick2018; García et al. Reference García, Lleida, Castán, Marcos and Romero2015). These studies are driven by the need for AAC systems to facilitate the construction of meaningful and grammatically correct sentences (Franco et al. Reference Franco, Silva, Lima and Fidalgo2018). Moreover, predictive models in AAC can offer numerous benefits to users (Beukelman and Light Reference Beukelman and Light2013), such as: (1) reducing the number of selections needed to construct a sentence, thereby decreasing the communication effort; (2) providing spelling support for users who struggle with accurate spelling; (3) offering grammatical support; and (4) increasing the communication rate (words per minute).
In a recent survey, Pereira et al. (Reference Pereira, Medeiros, Zanchettin and Fidalgo2022a) listed eight studies proposing pictogram prediction methods in AAC. The survey’s results indicate that the methods used for prediction have changed over time, ranging from knowledge databases to statistical language models. Pereira et al. (Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b) demonstrated that fine-tuning Bidirectional Encoder Representations from Transformers (BERT) for pictogram prediction leads to better performance and generalization than n-gram language models and knowledge databases. However, the proposed model’s ability to adapt to different users or user groups’ needs, using it for languages other than English, is still problematic. The main difficulty is the lack of corpora to be used for training. Previous works used conversational natural language corpora adapted for AAC (Dudy and Bedrick Reference Dudy and Bedrick2018; Pereira et al. Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b).
This paper proposes using BERT for pictogram prediction in Brazilian Portuguese. It involves constructing and using an AAC corpus to fine-tune BERTimbau (Souza, Nogueira, and Lotufo Reference Souza, Nogueira and Lotufo2020), a Brazilian Portuguese version of BERT. For corpus construction, we first collect AAC-like sentences constructed by AAC practitioners; then, we use GPT-3 Brown et al. (Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei2020a) to generate similar synthetic sentences; finally, we convert the natural language sentences into pictogram-based sentences. For BERTimbau fine-tuning, we adapted the model by changing its vocabulary and embedding layer to handle the vocabulary present in the generated synthetic corpus. We tested the different approaches found in the literature on how to represent a pictogram in pictogram prediction: as a word (using pictogram captions), as a concept (using a dictionary definition), and as a set of synonyms (using related words). With these tests, we aim to answer the following question: What is the best way to represent a pictogram for prediction using a BERT-like model? Besides, considering that a pictogram is a visual support for communication in AAC systems, we assessed the usage of images for pictogram prediction. The goal is to answer the question Can image representations increase the quality of pictogram prediction using a BERT-like model?
We evaluated the performance of model variations in terms of perplexity and top- $n$ accuracy. We use $n \in \{1,9,18,25,36\}$ to simulate the different grid sizes an AAC system can have. The results demonstrate that using embeddings computed from the pictograms’ caption, synonyms, or definitions have a similar performance. Using synonyms leads to lower perplexity, but using captions leads to the highest accuracies. This way, choosing a method to implement in an AAC system is a design decision. A lower perplexity indicates that the model can generalize unseen data well. However, using synonyms requires the preexistence of a database of synonyms. Using only captions can cause problems when the vocabulary has many pictograms for the same word. An alternative to solving this is using the pictogram definition, as in a dictionary. Previous studies demonstrated that a pictogram is better represented by a dictionary concept (Schwab et al. Reference Schwab, Trial, Vaschalde, Vial, Esperança-Rodier and Lecouteux2020; Pereira et al. Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b). However, the definition-based fine-tuning did not outperform the caption- and synonyms-based in our experiments. Using images for compute embeddings requires more training data and time, for the vectorial space differs from the BERTimbau input embeddings. The code for these experiments is available online.Footnote a
The findings of this paper hold valuable implications for researchers, practitioners, and developers engaged in AAC systems, particularly those aiming to incorporate communication card prediction into their systems. The target audience for such systems typically comprises children with complex communication needs who face challenges in conventional writing or utilizing a traditional keyboard, such as QWERTY, for communication purposes. It is important to note that the intended users of these systems may or may not be literate. In the case of literate children, cognitive deficits may hinder their ability to effectively use written language, making AAC systems a supportive tool for communication. Alternatively, AAC is an alternative resource for non-literate children, as it relies on a graphical system rather than conventional writing. By leveraging the insights and methodologies presented in this paper, researchers, practitioners, and developers can enhance the design and functionality of AAC systems, ultimately enabling effective and efficient communication for this target audience.
This paper is organized as follows: in Section 2, we present the theoretical information that is this work’s basis; in Section 3, we present the proposed method for fine-tuning BERTimbau and experimental details; in Section 4, we present our results; and, finally, in Section 5, we present the conclusions.
2. Background
2.1 Language modeling
A language model assigns probabilities to sequences of words (Jurafsky and Martin Reference Jurafsky and Martin2019). Consider the sentence “Brazil is a beautiful______” and ask what is the best word to complete it. Most people will choose words such as “country,” “place,” or “nation,” for they are the most probable among those that occur in natural language texts. This human decision is so natural that we do not think about how it happens. However, for language models, deciding which word to use to complete a sentence depends on the probabilities learned from a training corpus. For example, for an n-gram language model, the most probable word is the one that occurs most frequently after the word “beautiful” in the training corpus. The same model can also assign a probability to an entire sentence and predict that the sentence “Brazil is a beautiful country” has a higher probability of appearing in a text corpus than the same words in a different order (e.g., “is country beautiful Brazil a”).
An n-gram language model is the simplest model that assigns probabilities to sequences of words (Jurafsky and Martin Reference Jurafsky and Martin2019). The aim is to predict the next word based on the $n-1$ preceding words. The model uses relative frequency counts to estimate the probability of each word in a vocabulary $V$ to be the next in the sequence $h$ . Given a large text corpus, one counts the number of times the sequence $h$ is followed by the word $w \in V$ . This way, in a bigram model ( $n=2$ ), the probability of the word “country” completing the sequence “Brazil is a beautiful _____” can be simplified to:
where $C$ is the function that counts the occurrence of words or sequences in the corpus. Since this is a bigram model, only the last preceding word is considered in the equation, which can be simplified to $P(country|beautiful)$ or $P(w_n|w_{n-1})$ .
The probability of an entire sequence can be estimated using the chain rule:
The assumption that the probability of the next word depends only on the previous word is called the Markov assumption (Jurafsky and Martin Reference Jurafsky and Martin2019). Markov models assume that it is possible to predict the probability of a future unit (e.g., next word) by looking only at the current state (e.g., last preceding word). However, language is a continuous input stream highly affected by the writer/speaker’s creativity, vocabulary, language development level, etc. Suppose one asks two people to describe the same scene from a picture in a single sentence. In that case, there is a probability of both constructing sentences with similar semantics but using different words or ordering them differently. Besides, in a written text, the occurrence of a specific word may depend not only on the $n-1$ preceding words but on the entire context, which can be the sentence, the paragraph, or all of the text. Still, n-gram models produce strong results for relatively small corpora and have been the dominant language model approach for decades (Goldberg and Hirst Reference Goldberg2017).
Among the language models that do not make the Markov assumption, we can highlight those based on recurrent neural networks (Elman Reference Elman1990) and the Transformers architecture (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017). Both may rely on word embeddings for feature extraction.
2.1.1 Word embeddings
Word embedding is a method to represent words using real-valued vectors to encode their meaning, assuming that words with similar meanings may be closer to each other in the vector space (Jurafsky and Martin Reference Jurafsky and Martin2019). Mikolov et al. (Reference Mikolov, Chen, Corrado and Dean2013a) proposed the skip-gram model (a.k.a. word2vec), which learns high-quality vector representations of words from large amounts of text. The quality of the learned vectors allows similarity calculations between words and even operations such as $King - Man + Woman = Queen$ , or $Madrid - Spain + France = Paris$ . This means that by subtracting the vector of the word Man from the vector of the word King and summing it with the vector of the word Woman, the resulting vector is closer to the vector of the word Queen than any other vector (Mikolov, Yih, and Zweig Reference Mikolov, Chen, Corrado and Dean2013a, Reference Mikolov, Yih and Zweig2013c). These vectors can also capture synonymy with quality, for words with similar meanings might have similar vector representations.
The skip-gram model’s training objective is to find word vectors useful for predicting the surrounding words in a sequence or a document (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b). This way, the model is trained using a self-supervised approach, which avoids the need for any hand-labeled supervision signal. Given a sequence of words $w_1,w_2,\ldots,w_n$ , the model attempts to maximize the average log probability calculated according to Equation (3), where $c$ is the training context size of words that are surrounding the center word $w_t$ . A large $c$ results in more training examples and can lead to a high accuracy but may require more training time Mikolov et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b). The basic skip-gram formulation defines $P(w_{t+j} |w_t)$ using the softmax function, as in Equation (4), where $v_w$ and $v'_{\!\!w}$ are the input and output vectors of $w$ , and $W$ is the vocabulary size. This formulation is impractical for the cost of computing the gradient of $log P(w_O|w_I)$ is proportional to the vocabulary size, which can be large. Mikolov et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b) suggest using the hierarchical softmax Morin and Bengio (Reference Morin and Bengio2005) as an efficient approximation of the full softmax. This way, the neural network behind skip-gram learns the best vector representation for each word in a vocabulary. The final model output is a dictionary with $\{word\;:\; vector\}$ pairs.
There is a set of other word embedding approaches with the same aim: to provide vector representation to words. We can classify skip-gram as a model that provides static embeddings, for the representation of a word will be the same indifferently of the context it occurs. For example, the word bat has a different meaning in the sentences He can’t bat the ball and Batman dress like a bat. However, in a static word embedding model, it has the same vector. The Transformers architecture (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) overcomes this problem by adding context to the embeddings.
2.1.2 Transformers
The Transformers architecture, introduced by Vaswani et al. (Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), is a neural network model that operates solely on self-attention mechanisms to compute input and output representations. This innovative approach allows for efficient and effective sequential data processing in various natural language tasks. Self-attention allows a Transformer to extract and use information from arbitrarily large contexts without passing it through intermediate recurrent connections as in RNNs (Jurafsky and Martin Reference Jurafsky and Martin2019). A self-attention layer maps the input sequences to output sequences of the same length. While processing the input, the model can access all the inputs, including the one in consideration. However, it has no access to information concerning inputs beyond the current one. The self-attention allows the model to relate different positions of a single sequence to compute the representation sequences’ items. By doing so, an attention-based approach compares an item of interest to a collection of other items to reveal their relevance in the context (or sequence) (Jurafsky and Martin, Reference Jurafsky and Martin2019). Given a sequence input, a transformer produces an output distribution over the entire vocabulary for language modeling. The most famous language models based on Transformers architecture are GPT series (Radford et al. Reference Radford, Narasimhan, Salimans and Sutskever2018, Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019; Brown et al. Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei2020a) and BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019).
GPT (Radford et al., Reference Radford, Narasimhan, Salimans and Sutskever2018, Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019; Brown et al. Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei2020a) is an auto-regressive generative language model standing for Generative Pre-trained Transformers. This model uses the Transformers architecture to learn word representation that transfers with little adaptation to a wide range of tasks Radford et al. (Reference Radford, Narasimhan, Salimans and Sutskever2018). The main task is to predict the next word in a given sequence and then learn the best vectorial word representations. These representations perform downstream tasks like sentiment analysis, machine translation, etc. The most recent version of the series is the GPT-3 (Brown et al. Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei2020a), demonstrating that language models are few-shot learners. This model and its rivals (e.g., Google PaLM model (Chowdhery et al. Reference Chowdhery, Narang, Devlin, Bosma, Mishra, Roberts, Barham, Chung, Sutton, Gehrmann, Schuh, Shi, Tsvyashchenko, Maynez, Rao, Barnes, Tay, Shazeer, Prabhakaran, Reif, Du, Hutchinson, Pope, Bradbury, Austin, Isard, Gur-Ari, Yin, Duke, Levskaya, Ghemawat, Dev, Michalewski, Garcia, Misra, Robinson, Fedus, Zhou, Ippolito, Luan, Lim, Zoph, Spiridonov, Sepassi, Dohan, Agrawal, Omernick, Dai, Pillai, Pellat, Lewkowycz, Moreira, Child, Polozov, Lee, Zhou, Wang, Saeta, Diaz, Firat, Catasta, Wei, Meier-Hellstern, Eck, Dean, Petrov and Fiedel2022) and DeepMind GOPHER (Rae et al. Reference Rae, Borgeaud, Cai, Millican, Hoffmann, Song, Aslanides, Henderson, Ring, Young, Rutherford, Hennigan, Menick, Cassirer, Powell, Driessche, Hendricks, Rauh, Huang, Glaese, Welbl, Dathathri, Huang, Uesato, Mellor, Higgins, Creswell, McAleese, Wu, Elsen, Jayakumar, Buchatskaya, Budden, Sutherland, Simonyan, Paganini, Sifre, Martens, Li, Kuncoro, Nematzadeh, Gribovskaya, Donato, Lazaridou, Mensch, Lespiau, Tsimpoukelli, Grigorev, Fritz, Sottiaux, Pajarskas, Pohlen, Gong, Toyama, d’Autume, Li, Terzi, Mikulik, Babuschkin, Clark, Casas, d., Guy, Jones, Bradbury, K. and Irving2021)) promoted a revolution in most of the NLP-related tasks for not huge amounts of annotated data are necessary to a downstream task. GPT-3 was trained with 100 times more data than its predecessor GPT-2. A large amount of training data and the high number of used parameters make GPT-3 powerful in performing on-the-fly tasks that were never explicitly trained. Among these tasks, we can cite machine translation, math operations, writing code, etc.
BERT is a language representation model that stands for Bidirectional Encoder Representations from Transformers (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019). This model uses the attention mechanism (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) to learn contextual relations between tokens (words or sub-words) in unlabeled texts by joint conditioning on both left and right contexts in all model layers. Unlike directional models, which process the input in sequence (left-to-right or right-to-left), BERT processes the entire sequence simultaneously. Thus, it allows the model to learn the word’s context based on all neighborhoods, left and right. To do this, the model performs masked language modeling (MLM). During training, the data generator randomly chooses 15% of the token positions for prediction. For example, if the $i$ -th token is chosen, it is replaced with (1) the $[MASK]$ token 80% of the time, (2) a random token 10% of the time, or (3) the unchanged $i$ -th token 10% of the time. The model attempts to predict the $i$ -th token based on the contextual information provided by the non-masked, generating a contextualized representation for each.
2.1.3 Evaluating language models
One approach to assess the quality of a language model is to implement it in an application and evaluate its performance improvement, known as an extrinsic evaluation (Jurafsky and Martin, Reference Jurafsky and Martin2019). However, this requires creating a complete system that uses the $n$ models being evaluated, which can be both time-consuming and computationally expensive. For example, if two models for pictogram prediction were being compared, the models would need to be trained, two AAC boards using each model would need to be created, and a metric related to communication would need to be measured. This process can require a lot of human and computational resources, making it difficult or even impossible to complete. On the other hand, an intrinsic evaluation metric assesses the quality of a model without taking any application into account (Jurafsky and Martin, Reference Jurafsky and Martin2019).
Perplexity ( $PP$ or $ppl$ ), an intrinsic evaluation metric, offers a quick and easy way to compare language models. It only requires a training and test dataset, making it a fast and low-resource experiment. Moreover, recent studies suggest that perplexity is correlated with the human judgment of sentences generated by language models (Shen et al. Reference Shen, Oualil, Greenberg, Singh and Klakow2017; Crook and Marin, Reference Crook and Marin2017; Adiwardana et al. Reference Adiwardana, Luong, So, Hall, Fiedel, Thoppilan, Yang, Kulshreshtha, Nemade, Lu and Le2020). The perplexity of a language model is a measure of how well it comprehends language. It is calculated by taking the inverse probability of the test set, divided by the number of unique words in the vocabulary (Jurafsky and Martin, Reference Jurafsky and Martin2019). A low perplexity indicates that the test set is not too surprising for the model, meaning it understands the language well. As an example, let’s say the test set is $W = w_1, w_2, \ldots, w_N$ :
The probability of $W$ can be expanded with the chain rule:
Where $P(w_i|w_1,\ldots,w_i-1)$ is the probability of the $i$ -th token, given the previous $i-1$ (i.e., the context). Thus, for a bigram model:
Notice that because of the inverse in Equation 6, the higher the conditional probability of the word sequence, the lower the perplexity.
We can calculate perplexity by exponentiating the cross-entropy. This gives us an estimate of the average number of words required to encode a given sequence of words using $H(W)$ .
BERT MLM does not directly compute perplexity since the cross-entropy is only calculated for masked tokens. But BERT does give the probability of a sentence from test sets by assigning the probability of each word when no masked token is inputted into the model. We can then use this sentence probability to calculate the cross-entropy and the perplexity.
2.2 Pictogram prediction in AAC
AAC employs a variety of tools and techniques to aid the communication of individuals with CCN. In the context of high-tech AAC, pictographic images on communication cards serve as visual aids for the user, providing meaning to the words in their vocabulary. These pictograph systems benefit illiterate individuals due to age or disability, enabling communication for those with low cognitive levels or at very early stages (Palao, Reference Palao2019). Numerous online databases offer a wealth of pictograms. However, there is no dataset as extensive as ARASAAC Palao (Reference Palao2019) for Brazilian Portuguese, making it the best alternative available. It provides access to over 30 thousand pictograms. Many of the available high-tech AAC systems arrange the pictograms in grids, as depicted in Figure 1. The organization of the vocabulary is tailored to the user’s needs and preferences. Some may opt to categorize the cards, while others may prefer multiple pages. Nevertheless, these systems must facilitate card selection for sentence construction (Franco et al., Reference Franco, Silva, Lima and Fidalgo2018).
Among the strategies to facilitate card selection in high-tech AAC systems, we can mention four as the main ones: (1) vocabulary organization—organize the cards meaningfully to facilitate searching (e.g., taxonomic organization); (2) color coding systems usage—use some color coding system to label cards, such as the Fitzgerald Keys (Fitzgerald, Reference Fitzgerald1949; McDonald and Schultz, Reference McDonald and Schultz1973) or Colourful Semantics Bryan (Reference Bryan1997); (3) motor planning strategies—use consistent motor patterns to facilitate card findings throughout motor memory (e.g., using the LAMP protocol (Halloran and Halloran, Reference Halloran and Halloran2006)); and (4) the usage of predictive models—predict the next cards suitable to complete sentences in construction. Predictive models can be used in addition to the other strategies to further refine the search for communication cards. Besides, the benefits of using prediction techniques in AAC include (Beukelman and Light, Reference Beukelman and Light2013): (1) reducing the number of selections required to construct a sentence, thereby decreasing the effort for individuals; (2) providing spelling support for users who cannot accurately spell words; (3) providing grammatical support; and (4) increasing communication rate (i.e., words per minute).
Communication card prediction in AAC assumes a controlled vocabulary containing the cards used in the user’s daily communication. The language model assigns the probability of each vocabulary item being the next in an in-construction sentence. Recent studies used different models to perform this role. The most common are based on knowledge bases (Pereira et al. Reference Pereira, Medeiros, Zanchettin and Fidalgo2022a). Such models may allow using semantic scripts like the Colourful Semantics (Bryan Reference Bryan1997; Pereira, Pereira, and Fidalgo Reference Pereira, Pereira and Fidalgo2021) as support for sentence construction. However, they generally rely on complex construction pipelines, which require reprocessing for vocabulary or knowledge updates. Training a statistical language model might be an alternative. Some other proposals use n-gram (Garcia et al., Reference Garcia, de Oliveira and de Matos2016; Hervás et al. Reference Hervás, Bautista, Méndez, Galván and Gervás2020) or neural network (Dudy and Bedrick, Reference Dudy and Bedrick2018; Pereira et al. Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b) models. The literature suggests that neural network-based language models may perform better than n-gram models (Goldberg and Hirst, Reference Goldberg2017). However, they may require more computational resources for training and serving, making their deployment difficult in production.
Choosing a pictogram prediction model may involve practical questions like computational resources, deployment, etc. However, it also involves conceptual decisions. An example is the decision of what a pictogram is. Simply, a pictogram is a graphic symbol representing an object or concept. It is usual to see pictograms on traffic signs, for example. In AAC, a pictogram is generally associated with a caption with the word or expression it represents. The pair pictogram caption forms the communication card, which the user selects and organizes to constitute a sentence. Some pictogram prediction approaches feed their models only with the captions (Garcia et al., Reference Garcia, de Oliveira and de Matos2016; Hervás et al. Reference Hervás, Bautista, Méndez, Galván and Gervás2020; Saturno et al. Reference Saturno, Ramirez, Conte, Farhat and Piucco2015), considering the task as a word prediction task. Other studies consider an AAC pictogram as a concept that links the graphical representation and the caption (Dudy and Bedrick, Reference Dudy and Bedrick2018, Pereira, Franco, and Fidalgo, Reference Pereira, Franco and Fidalgo2020; Dudy and Bedrick, Reference Dudy and Bedrick2018; Martínez-Santiago et al. Reference Martínez-Santiago, Díaz-Galiano, Ureña-López and Mitkov2015). Schwab et al. (Reference Schwab, Trial, Vaschalde, Vial, Esperança-Rodier and Lecouteux2020) consider that a concept from a dictionary better represents a pictogram (e.g., person: a human being). They associated the ARASAAC pictograms with synsets (i.e., concepts) from the Princeton WordNET,Footnote b a lexical database for English.
For prediction, using concepts may be more meaningful because of polysemic words. For example, the English word “bat” can have many meanings (e.g., “nocturnal mouselike mammal” or “a club used for hitting a ball”) and, similarly, many related pictograms in a given vocabulary. The way how to do this varies among approaches. Dudy and Bedrick (Reference Dudy and Bedrick2018) grouped the words related to each pictogram and calculated embeddings to feed their LSTM model. Pereira et al. (Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b) associated each pictogram with a WordNET synset and used the vectors calculated by Scarlini et al. (Reference Scarlini, Pasini and Navigli2020) for each synset as inputs of their BERT-based model.
Although there are proposals for predicting pictograms in AAC, a pictogram-based corpus is not available. Pereira et al. (Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b) and Dudy and Bedrick (Reference Dudy and Bedrick2018) used corpora in natural language adapted to the task. Pereira et al. (Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b) proposed SemCHILDES, which consists of part of the Child Language Data Exchange System (CHILDES) (MacWhinney, Reference MacWhinney2014) dataset annotated with word senses. The corpora in CHILDES have transcribed conversations between children and parents, therapists, or teachers. The conversational nature and the public audience may make it suitable for pictogram prediction in AAC. However, following Pereira et al. (Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b)’s pipeline, the corpus requires some pre-processing steps. In comparison, Dudy and Bedrick (Reference Dudy and Bedrick2018) used an adapted version of SubtlexUS (Brysbaert and New, Reference Brysbaert and New2009), a corpus of subtitles from movies and television. The authors used the corpus as a proxy for AAC due to its spontaneous speech. Vertanen (Reference Vertanen2013) proposed a corpus with everyday conversation communications. The corpus has natural language sentences produced by workers from a crowdsourcing site. Thus, it is not properly an AAC corpus.
3. Method
This section outlines the method employed for constructing our corpus. The approach builds upon the foundational work presented in Pereira et al. (Reference Pereira, Nogueira, Zanchettin and Fidalgo2023), where we initially detailed the corpus construction method for AAC systems. In this paper, we extend and refine this method, consisting of augmenting a set of sentences constructed by AAC practitioners. Figure 2 illustrates the method flow. The three main inputs are a controlled vocabulary, a pre-trained embedding matrix, and a pre-trained transformer. We detail inputs in Section 3.1. The two main steps are 1) corpus construction (cf. Section 3.2) and 2) model fine-tuning (cf. Section 3.3). The main output of this method is the fine-tuned model, but we also consider the constructed corpus as a relevant output.
3.1 Inputs
The three input resources for our method are (1) a pre-trained BERT, (2) a controlled vocabulary, and (3) a pre-trained embedding matrix. As an input model, we used BERTimbau (Souza et al., Reference Souza, Nogueira and Lotufo2020), a Brazilian Portuguese version of BERT. As a controlled vocabulary, we consider a list of communication cards, each consisting of (1) a pictogram (or picture) and (2) a caption with a word or a multi-word expression. It is common in the AAC field to have pre-defined vocabularies aimed at different contexts, activities, etc. An example is Project-Core,Footnote c which defines a list of 36 symbols as sufficient for a universal core communication. Our experiments use the list of pictograms for Brazilian Portuguese from the ARASAAC dataset. There are 12785 pictograms related to words and multi-word expressions (MWEs) (e.g., “café da manhã,” i.e., breakfast).Footnote d
As mentioned in Section 2.1.1, word embeddings are real-valued vectors used to represent words. In our experiments, we extract embeddings from four sources: (1) the pictogram caption (i.e., word or expression); (2) the pictogram caption synonyms; (3) the pictogram glossary definition from ARASAAC; and (4) the pictogram image. For the caption embeddings, we use the input vectors from BERTimbau as a basis. Formally, given a vocabulary $V$ composed of words and MWEs $(w_1, \ldots, w_n)$ , the BERTimbau original embedding $B \in \mathbb{R}^{h \times D_b}$ , where $h$ is the size of the hidden state and $D_b$ is the BERTimbau vocabulary size, and given a new embeddings matrix $P \in \mathbb{R}^{h \times D_v}$ , where $D_v = |V|$ , for each token $t_i$ in $V$ , populate $P$ with the $t_i$ embeddings from $B$ . For MWEs, the embeddings of each token are extracted from BERTimbau’s embeddings layer to a matrix $E \in \mathbb{R}^{h \times n}$ , where $h$ is the dimensionality of the embedding (the same of hidden states size), and $n$ is the number of tokens in the expression. We use the mean vector $\overline{E}$ as the expression’s embedding representation. We use an approach similar to Dudy and Bedrick (Reference Dudy and Bedrick2018) for caption synonyms. First, we search in ARASAAC for the list of keywords for each pictogram. The pictogram representation is the average of the embeddings of its keywords.
For generating embeddings from pictogram definition, we applied two methods. Both methods use the definitions from ARASAAC concatenated with keywords. A pictogram in ARASAAC has a list of keywords, which have a definition each. We concatenate this list as $keyword_i || definition_i || \ldots || keyword_n || definition_n$ . The first extraction method considers the mean vector of the definition extracted from $B$ (i.e., BERTimbau input embeddings). The second method uses the BERTimbau last encoders layer outputs for the $[CLS]$ token.Footnote e We also computed representations from pictogram images using a Vision Transformer (ViT). We used a ViT model pre-trained on ImageNet-21k (14 million images, 21,843 classes) and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) (Dosovitskiy et al. Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby2020).Footnote f
3.2 Corpus construction
This section presents the method for constructing our corpus. Our method consists of augmenting a set of sentences constructed by AAC practitioners. For this, we applied a four-step pipeline: (1) collection of sentences (cf. Section 3.2.1); (2) data augmentation (cf. Section 3.2.2); (3) data cleaning (cf. Section 3.2.3); and (4) text-to-pictogram transformation (cf. Section 3.2.4). Section 3.2.5 presents an analysis of the corpus’s main features.
3.2.1 Collection of sentences
For collection of sentences, we invited speech therapists, psychologists, and parents of children with CCN to inform the sentences they consider the most commonly constructed in different contexts using high-tech AAC. We make an online questionnaire available and send it to groups of people interested in AAC. In addition, we invited experts who had participated in other studies that we had previously conducted. Seventeen individuals participated in this study. Figure 3 presents a summary of the participants. Most have used AAC with their children or patients for more than six years. That is, they have vast experience in using such tools. Besides, we had participants from at least six different fields, who may observe the AAC usage from various points of view.
Each participant answered a questionnaire with six questions asking them to construct sentences. The first four questions required sentences about home, school, kitchen, and leisure contexts. The fifth question asked the participant to construct sentences that describe events free of context (e.g., I ate eggs at breakfast today). The last question asked them to construct sentences free of context that they consider essential for AAC. With this procedure, we collected a total of 667 unique sentences.
3.2.2 Data augmentation
The data augmentation step aims to generate sentences similar to those constructed by AAC practitioners, which we now refer to as human-composed. The generated sentences must be similar regarding used words (vocabulary) and sentence structure (semantics and syntax). We used GPT-3 Brown et al. (Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei2020b)Footnote g with a few-shot learning approach. We provide some examples to GPT-3 in the form of text prompts and ask it to produce new similar examples by completing our prompts. We used two approaches: (1) using the human-composed sentences as examples, and (2) using a controlled vocabulary as a basis. We explain each approach in more detail below.
We used the human-composed sentences as few-shot examples in the GPT-3 prompt. We shuffled the human-composed sentences to induce variability in the generated sentences regarding participant style. Then, we divide the sentences into groups of 10 and use them as examples in the GPT-3 prompt, as shown in Figure 4. This prompt is inputted into the model, producing a completion following the same structure as the examples. In Figure 4, we present the prompt used for sentence generation (a) and an English version (b) to facilitate reader understanding. With this prompt, we generated 2,772 sentences, which are available for download.Footnote h
We used the words related to the pictogram in the Brazilian Portuguese subset of ARASAAC as a basis for generating new sentences through GPT-3. This vocabulary consists of 12,785 pictograms with words and expressions (e.g., “good morning”). Each pictogram has a list of keywords. In total, there are 11,806 unique terms, including words and MWEs. We shuffled the vocabulary items and divided them into groups of 20. We randomly selected five words (or expressions) from each group and used them to search for example sentences on our already collected corpus. We sampled from three to six example sentences for each group and used them as few-shot examples on the GPT-3 prompt, as shown in Figure 5.
3.2.3 Data cleaning
An automatically generated corpus like the one we produced can have misleading sentences. Therefore, we performed a data cleaning step, which consisted of (a) removing sentences with offensive content using the method proposed by Leite et al. (Reference Leite, Silva, Bontcheva and Scarton2020), (b) removing sentences with higher perplexities according to BERTimbau Souza et al. (Reference Souza, Nogueira and Lotufo2020) and choosing the sentences in the first quartile for removal, and (c) removing sentences with less than three or more than 11 tokens. As mentioned in Section 2.1.3, BERT-like models do not directly compute perplexity, for the cross-entropy is calculated only for masked tokens. However, if no masked token is inputted into the model and a copy of the input sentence is used as labels, it can assign a probability to each word in the sentence. We can then use this sentence probability to calculate the cross-entropy and the perplexity.
3.2.4 Text to pictogram
This section details how we transformed natural language sentences into pictograms. We used the Brazilian Portuguese set of pictograms from the ARASAAAC database. As mentioned before, each pictogram has a list of keywords, and each keyword has a glossary definition. However, a single term can be used for multiple pictograms. For example, the word “banco” (i.e., bank) has at least three pictograms. Thus, it is a word sense disambiguation problem. We solve this problem using BERTimbau (Souza et al., Reference Souza, Nogueira and Lotufo2020) to encode the target sentence and pictogram definitions and the K-Nearest Neighbor algorithm to choose the most relevant pictogram for each word in a given sentence.
For example, given the sentence “ele quer fazer xixi” (he wants to pee), the first step is to tokenize it. We use all the keywords in ARASAAC as our vocabulary. It includes MWEs like “fazer xixi” (pee) or “café da manhã” (breakfast). For handling such expressions, we used a MWE tokenizerFootnote i in such a way that the tokenized version of the example sentence is $S_t = \{ele,\ querer,\ fazer\ xixi\}$ . We also lemmatize the sentence, for the pictograms in ARASAAC have lemmas as keywords. We search the ARASAAC database for matching pictograms for each token in the sentence. When more than one pictogram is found, disambiguation is necessary. We concatenate the pictogram definitions and encode them using BERTimbau. We consider the sum of the hidden states of the last four encoder layers for the token [CLS] as the pictogram representation in an approach similar to Scarlini et al. (Reference Scarlini, Pasini and Navigli2020). For the target token, we consider as representation the vector that is the token position given the target sentence. In cases of MWEs, we consider the mean representation of the tokens in the expression. The final step is to get the pictogram representation most similar to the target representation using the KNN algorithm. Figure 6 presents the pictogram version of the example sentence. For some words, there is no equivalent pictogram in ARASAAC. Still, we keep the word in the sentence, considering that one can use a customized picture to represent it or a pictogram from another dataset.
3.2.5 The constructed corpus
Table 1 summarizes the constructed corpus. The corpus consists of a set of 13,796 sentences that have the following characteristics: (1) are in direct order (i.e., subject+verb+complements); (2) are examples of phrases spoken in a conversation; (3) have a simple vocabulary; and 4) are common in the AAC context.
Figure 7 presents a chart that displays the frequency of words in the corpus, with a separate section for stop words, sorted by frequency. The chart provides an overview of the most common terms used in the corpus. It can help identify patterns or trends in the language used. Notably, the most frequent word (excluding stopwords) in the corpus is “quero” (i.e., “I want”), suggesting a prominent focus on expressing wants or desires within the dataset. This aligns with the common usage of AAC systems, where users often communicate their needs and preferences. The high frequency of the word “quero” signifies a recurring theme of expressing intentions and personal desires in the sentences constructed by AAC practitioners and generated by GPT-3. The chart also displays the frequency of stop words, which are words that are not semantically meaningful, such as “o,” “a,” “de,” etc. Stop words in high frequency indicate that the corpus contains many common, everyday languages rather than specialized or technical ones. Overall, the chart in Figure 7 can be a useful tool for analyzing the language used in a corpus and gaining insight into the topics and themes it covers.
The chart in Figure 8 displays the frequency of word combinations, specifically bigrams and trigrams, in the corpus. Bigrams are combinations of two words (e.g., “I am”), and trigrams are combinations of three words (e.g., “I am going”). The chart is sorted by frequency, with the most frequent bigrams and trigrams appearing at the top. This type of analysis is useful for identifying common phrases and idiomatic expressions used in the corpus and understanding the relationship between words in the language. Additionally, it can provide insight into the style and tone of the text, such as whether it is formal or informal. Overall, the chart in Figure 8 can be a valuable tool for understanding the language used in the corpus at a deeper level. For example, the most frequent bigram is “eu quero” (I want), indicating that the corpus might be focused on expressing wants or desires. Additionally, it can be used to identify patterns in the language, such as specific conjunctions or prepositions, which can further inform the analysis of the corpus.
Figure 9 presents the word and word combination (bigrams and trigrams) frequency distributions for the human-composed corpus. This figure provides a valuable basis for comparing the distribution of the generated corpus with the human-composed one. Upon analyzing the chart and comparing it to the frequency distributions shown in Figures 7 and 8, it becomes evident that the human-composed and generated corpora exhibit similar patterns. Precisely, the frequency distribution of the most common words and stop words in the human-composed corpus aligns closely with their presence in the generated corpus. This similarity reinforces the effectiveness of using GPT-3 to generate synthetic sentences that resemble those composed by AAC practitioners. It indicates that the generated corpus captures essential linguistic patterns present in real-world AAC communication. The presence of similar word combinations (bigrams and trigrams) further corroborates the compatibility between the human-composed and generated corpora, strengthening the case for the effectiveness of the proposed methodology in creating a synthetic AAC dataset that mirrors the language patterns observed in real-life AAC interactions.
In addition to evaluating the quality and representativeness of the automatically generated sentences, we conducted a coverage assessment for the constructed corpus. The coverage measures the fraction of sentences generated through text augmentation that are assigned to the same cluster as at least one human-composed sentence. To quantify the coverage, we adopted a clustering-based approach, generating sentence embeddings for both the human-composed and augmented corpora. The k-means clustering algorithm was utilized to group the sentence embeddings into distinct clusters. For generating the sentence embeddings, we employed BERTimbau Souza et al. (Reference Souza, Nogueira and Lotufo2020), using the average vector output from the last four encoder layers to represent the $[CLS]$ token. This methodology allowed us to effectively assess the degree of overlap between the human-composed and automatically generated sentences, shedding light on the corpus’s overall coverage and ability to capture essential linguistic patterns.
To evaluate the coverage of the generated corpus, we collected an additional 203 sentences from AAC specialists. This set is referred to as the test set of the human-composed corpus. The original 667 sentences collected from the specialists constitute the training set of the human-composed corpus. The test set provides a means of measuring the quality and reliability of the generated corpus by comparing its content with the human-composed sentences.
The line chart in Figure 10 depicts the coverage ratio of three different scenarios: the blue line represents the coverage ratio of the automatically generated corpus over the test set of the human-composed corpus. The orange line represents the automatically generated corpus coverage ratio over the human-composed corpus training set. Finally, the green line represents the coverage ratio of the test set of the human-composed corpus over the training set.
As the number of clusters increases from 10 to 200, we can observe that the blue line (coverage of the automatically generated corpus over the test set of the human-composed corpus) decreases deeper than the other two lines. This can be explained by the fact that the human-composed corpus is smaller than the generated one, leading to a decrease in coverage as the number of clusters increases. However, it is important to note that both the orange and green lines remain relatively stable throughout the range of the number of clusters, showing that the coverage of the auto-generated corpus over the training set and the test set of the human-composed corpus over the training set, respectively, is not significantly affected by the number of clusters.
The results demonstrate that the generated corpus is semantically similar to the original human-composed corpus, with a coverage ratio of up to 0.7 for the training set of the human-composed corpus, even when a large number of clusters is used. The coverage ratio is slightly lower but still significant for the test set of the human-composed corpus, remaining up to 0.5 with fewer than 130 clusters utilized.
3.3 Fine-tuning
For fine-tuning BERTimbau for pictogram prediction, first, we have to change the model vocabulary and the input embeddings layer. BERT and BERTimbau use a vocabulary based on WordPiece (Wu et al. Reference Wu, Schuster, Chen, Le, Norouzi, Macherey, Krikun, Cao, Gao, Macherey, Klingner, Shah, Johnson, Liu, Kaiser, Gouws, Kato, Kudo, Kazawa, Stevens, Kurian, Patil, Wang, Young, Smith, Riesa, Rudnick, Vinyals, Corrado, Hughes and Dean2016), which divides words into a limited set of common sub-word units (e.g., “Playing” into “Play#” and “#ing”). This vocabulary does not apply to pictogram prediction, for the tokens in pictogram-based sentences must be unique identifiers that cannot be divided into sub-items. For example, the sentence in Figure 6 is represented as “6481 31141 16713.” Our vocabulary consists of identifiers for ARASAAC pictograms. This way, we use a word-level tokenizer, which splits words in a sentence by white spaces.
Changing the vocabulary requires changing the embeddings layer, also. Intuitively, we tell the model that we changed the vocabulary to use a new language, and the new embedding vectors represent the words in this new language (Pereira et al. Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b). As mentioned in Section 3.1, we use different approaches for pictogram embeddings.
We use the corpus constructed with the method presented in Section 3.2 as the training data. The corpus has a total of 13796 sentences, which we randomly divide with a proportion of 68/16/16 for training, test, and validation sets. We fine-tune with a batch size of 768 sequences with 13 tokens (768 * 13 = 9,984 tokens/batch). Each data batch was collated to choose 15% of the tokens for prediction, following the same rules as BERT: If the $i$ -th token is chosen, it is replaced with (1) the $[MASK]$ token 80% of the time, (2) a random token 10% of the time, or (3) the unchanged $i$ -th token 10% of the time. We use the same optimizer as BERT (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019): Adam, with a learning rate of $1 \times 10^{-5}$ for all model versions, with $\beta _1 = 0.9$ , $\beta _2 = 0.999$ , L2 weight decay of 0.01, and linear decay of learning rate. Fine-tuning was performed in a single 16 GB NVIDIA Tesla V100 GPU for 200 epochs for the captions and synonyms versions and 500 epochs for the other versions. The definition- and image-based versions require more training time because the input vectors are from a different vectorial space than the original BERTimbau embeddings.
4. Results and analysis
Table 2 presents the results obtained by testing each version of the proposed model regarding perplexity (PPL) and Top- $n$ accuracies. We use $n \in \{1,9,18,25,36\}$ to simulate the different grid sizes an AAC system can have. We calculate perplexity by exponentiating the cross-entropy over the test set’s entire sentences without masked tokens (cf. Section 2.1.3). For perplexity, lower is better. The table shows that the model in which the embeddings were calculated using the pictogram caption synonyms has the lowest perplexity. This means that this model better understands how the language present in the test set works. Seeing new data than the other model versions was intuitively less surprising. Thus, it can perform a better generalization in different scenarios. The model with embeddings extracted only from pictograms’ captions had better accuracy. However, the difference between these two models in all metrics is not as significant enough to indicate which is the best.
Regarding the models in which the pictogram definitions were used to compute embeddings, using the mean vector of the definition extracted from the BERTimbau input embeddings was shown to be more effective. Using the BERTimbau outputs as the definition representation did not show good results, with higher perplexities and lower accuracies. Fine-tuning BERTimbau using these embeddings may require more training data and time, for the vectors are from a vectorial space different from the model’s original. The same happens to the models using embeddings computed from pictogram images and their combinations. Based on these models’ training and validation loss curves, there is still space for improvement, as the measures keep falling even after 500 epochs.
Therefore, based on the metrics presented in Table 2, the best way to represent a pictogram in the proposed method is using the pictogram caption or its synonyms. The decision of which of these two approaches to use depends on the vocabulary characteristics. For example, it is impossible to use synonyms if no synonyms dataset is available. However, if the same caption is used for two different pictograms in a vocabulary, it may be difficult for the model to disambiguate them. Using the pictogram concept, as in Pereira et al. (Reference Pereira, Macêdo, Zanchettin, de Oliveira and do Nascimento Fidalgo2022b), can solve these problems. Nevertheless, to our knowledge, there is not for well-established Brazilian Portuguese lexical database as the Princeton WordNET is for English. Using the pictogram definition can be an alternative, but the results demonstrate that it performs worse than using only captions or synonyms. In addition, encoding pictograms based on their definition may require more time and resources than using captions or related words.
Figure 11 presents four sentences from the test dataset and the top-6 predictions performed by the model trained using embeddings from captions. The examples show the model behavior in different scenarios. Figure 11(a) presents an example of a subject+verb+complement sentence. The sentence represented is equivalent to “you want a ______.” or “do you want a ______” in English. Thus, it can be an affirmation or a question in construction. The top-6 pictograms suggested as completions demonstrate that the model prediction is affected by the token um (or a), which is a preposition. Figure 11(b) presents an example using an auxiliary verb (i.e., ir, to go). In this case, the model predicts pictograms that can act as the sentence’s main verb. Figure 11(c) shows an example of descriptor prediction. Notice that there are two pictograms to the word novo (i.e., new). It occurs because they have the same caption. Figure 11(d) presents an example of predicting the second pictogram of a sentence that begins with a verb. In this case, the verb estar can mean I am (e.g., I am tired) or it is (e.g., It is hot).
4.1 Usage guidelines: how can others use this work?
Researchers, developers, and practitioners interested in utilizing the proposed method and findings presented in this work can follow the guidelines outlined below to enhance pictogram prediction in AAC systems, considering it as a low-resource domain:
-
Constructing a synthetic AAC corpus: Researchers can extend the method for constructing the synthetic AAC corpus to create their own corpus. This approach can be applied to different languages or specific target populations. By following the methodology described in Section 3.2, researchers can adapt the process and gather data relevant to their specific context and objectives. It is worth mentioning that the generated corpus depends on the input sentences and vocabulary. Furthermore, it is possible to induce the model to generate sentences of a specific context or user or groups of user needs. This allows for a more tailored approach to creating a synthetic AAC corpus catering to specific requirements or preferences.
-
Fine-tuning a language model: The constructed synthetic AAC corpus can be used for fine-tuning transformer-based language models such as BERT. Researchers can combine the corpus with the methodology presented in Section 3.3 to adapt the language model for pictogram prediction. This process allows for personalized message authoring in AAC systems, enhancing the system’s relevance and accuracy in generating suggestions.
-
How to represent a pictogram: Researchers or AAC developers can use our experiments as a basis to decide how to represent a pictogram when using a transformer-based model. Our experiments have shown that when it comes to representing pictograms, there are a few different approaches that yield similar results. One way is to use the captions associated with the pictograms, treating the prediction task as a word prediction task. However, it is important to consider that in some vocabularies different pictograms can have the same caption having the same vectorial representation. Another approach is to use synonyms or definitions, but this requires access to an external database that may not always be available. These findings may be helpful for developers and researchers looking to work with pictogram prediction.
-
Developing AAC systems with pictogram prediction: Developers can leverage the proposed method to design AAC systems that perform pictogram word prediction based on the user’s vocabulary. To implement this, developers can utilize the method we proposed to create a corpus, modify a transformer-based model, and train it accordingly. By incorporating pictogram prediction, AAC systems can enhance the user experience, providing real-time suggestions that facilitate efficient and effective communication.
4.2 Limitations
The limitations of this study primarily stem from the fact that the proposed models were not evaluated in a real-world AAC system by actual users, either with or without CCN. This is a significant limitation as the effectiveness and efficiency of AAC solutions can be best evaluated in a practical setting, where users interact with the system in their daily communication. However, developing a fully functional AAC system that incorporates the models proposed in this paper is beyond the scope of this study. This study focused on developing and evaluating the models in a controlled environment, which may not fully reflect the complexities and challenges of real-world AAC usage.
Another limitation of this study is the lack of diversity in the AAC corpus used for training the model. The corpus was constructed using sentences generated by AAC practitioners and synthetic sentences generated by GPT-3, which may not fully represent the diverse communication needs and styles of AAC users. Nevertheless, it’s important to note that constructing the corpus is a crucial step in our methodology that can be replicated for various scenarios. The output of corpus generation is dependent on the input sentences and vocabulary. If a diverse set of sentences is used, it may lead to a more varied corpus. However, we should also consider the costs associated with corpus generation, which can limit the quantity of generated sentences and ultimately affect the corpus’s diversity. Additionally, this study focused on evaluating the models’ performance in Brazilian Portuguese and did not explore their applicability to other languages. The effectiveness of the models in different languages may vary due to language-specific characteristics and variations in vocabulary usage. Further research is needed to adapt and evaluate the models for other languages.
Furthermore, we also recognize as a limitation the fact that we used a model that was not specifically trained for Brazilian Portuguese for generating the sentences. This could affect the generated sentences’ accuracy and relevance, as the model may not fully capture the nuances and complexities of the Brazilian Portuguese language. Future studies could consider using a model specifically for Brazilian Portuguese like Sabiá Pires et al. (Reference Pires, Abonizio, Almeida and Nogueira2023).
Finally, the study did not consider the potential influence of user-specific factors on the models’ performance, such as age, cognitive abilities, or familiarity with AAC systems. These factors can significantly impact the usability and effectiveness of AAC systems. Future research should explore these factors to optimize the models for individual users and address their unique communication needs.
Despite these limitations, this study provides valuable insights into using BERT-like models for pictogram prediction in AAC systems and lays the groundwork for future research.
5. Conclusions
Recent studies propose methods for pictogram prediction in AAC systems as an alternative to support the construction of meaningful and grammatically adequate sentences. The existing methods vary regarding the technique used for prediction and how to represent a pictogram. In AAC, a pictogram (a.k.a. communication card) combines a pictograph and a caption representing a concept (e.g., an action, person, object, or location). Some studies consider that the word or expression in the caption is enough to perform prediction. At the same time, others prefer to represent the pictogram as a series of synonyms or a concept from a dictionary.
In this paper, we investigate the most appropriate known manner to represent pictograms for pictogram prediction in Brazilian Portuguese. To do this, we propose a method for fine-tuning a BERT-like model for pictogram prediction from scratch that might be suitable for languages other than Portuguese. First, we constructed an AAC corpus for Portuguese by collecting sentences from specialists and augmenting the data using a large language model. Then, we fine-tuned BERTimbau, a Portuguese version of BERT. We conducted experiments using four different ways of representing pictograms: (1) using the captions (i.e., words or expressions), (2) using the caption synonyms, (3) using the pictogram definition, and (4) using the pictogram image to compute embeddings. We evaluated the performance of the models in terms of perplexity and top- $n$ accuracy. The results demonstrate that using embeddings computed from the pictograms’ caption, synonyms, or definitions have a similar performance. Using synonyms leads to lower perplexity, but using captions leads to the highest accuracies. This suggests that choosing a method to implement in an AAC system is a design decision. Additionally, we found that using image representations did not improve the quality of pictogram prediction.
We recognize using a synthetic corpus as a limitation of this study. Although the corpus was constructed using human-composed sentences as a basis, the resulting sentences can suffer the influence of the GPT-3 training biases (Dale Reference Dale2021). To reduce this impact, we removed the sentences with offensive content. In addition, GPT-3 can generate incoherent sentences with confusing semantics as it was not trained specifically for Portuguese. This can also affect the diversity of produced sentences, as it must have seen less Portuguese text than English during training. Initiatives such as the Open Pre-trained Transformers (Zhang et al. Reference Zhang, Roller, Goyal, Artetxe, Chen, Chen, Dewan, Diab, Li, Lin, Mihaylov, Ott, Shleifer, Shuster, Simig, Koura, Sridhar, Wang and Zettlemoyer2022) might boot the emergence of models trained in languages other than English, which can lead to more comprehensive and coherent text generation. An example is Sabiá Pires et al. (Reference Pires, Abonizio, Almeida and Nogueira2023), a model trained for Brazilian Portuguese that we intend to use in future works. We also recognize that humans (with or without CCN) can assess AAC solutions more accurately. However, to do so, an AAC system using the proposed models for prediction is required, which is not addressed in this paper. We propose a method to train models to be plugged into end-to-end applications that consider the particular needs of each user or group of users.
In future work, we intend to evaluate the model prediction quality by testing it with AAC users’ parents and caregivers and then with people with CCN. Besides, we intend to implement a text expansion system for Brazilian Portuguese capable of expanding telegraphic sentences (e.g., eu comer bolo escola ontem, i.e., I eat cake school yesterday) to natural language expanded sentences (e.g., eu comi bolo na escola ontem, i.e., I ate a cake at school yesterday). The text expansion might help the interlocutor understand what the AAC user says.
Acknowledgment
This research was supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES). Grant code: [88887.481522/2020-00]. The pictographic symbols used are the property of the Government of Aragón and have been created by Sergio Palao for ARASAAC (http://www.arasaac.org), which distributes them under Creative Commons License BY-NC-SA.