Martina Schrader-Kniffki, Carme Colominas, Kristina Bedijs, Paula Bouzas, Stefan Schneider and Daniel Kallweit
Abstract: This collective article assembles information about tertiary media corpora in the Romance languages. The authors present published and unpublished corpora, seminal works in Romance corpus linguistics and lacunae with regard to tertiary media. Whereas there already exist many corpus projects for the “big” Romance languages, there is still much work to do for the “smaller” ones. Recent developments in language processing technology promises to facilitate research in corpus linguistics in the next years.
Keywords: cinema, corpus, corpus linguistics, internet, methodology, quantitative linguistics, radio, spoken language, television, tertiary media, written language
The following presentation of Portuguese language media corpora is based on a rather small assortment of accessible corpora and therefore comprises corpora of different sizes, quantitative and/or qualitative representativeness, different degrees of detail of annotation and electronic accessibility. It includes smaller corpora which can be consulted or even taken as a starting point for possible extension. These corpora are project-related corpora of raw data (cf. Beißwenger 2008) and are mostly individually built and accessible only directly from the respective author of the project. A main distinction regarding the here presented corpora refers to (1) media corpora besides Internet corpora such as telephone calls, radio and TV corpora and (2) media corpora based on computer-mediated communication.
Within the field of study of spoken Portuguese, SPEECHDAT is a corpus which contains 5,000 telephone calls in Portuguese language (cf. INESC-ID n.d.a). Besides its web access version, it figures in a CD enclosed in Bacelar do Nascimento et al. (2005, 207) as part of a general corpus of contemporary Portuguese (Corpus de Referência do Português Contemporâneo, CRPC). Originally, the corpus was designed in the context of information technology and communication by INESC (Instituto de Engenharia de Sistemas e Computadores) in cooperation with Portugal Telecom as part of research into spoken Portuguese. While its original aim was to provide a database for voice driven teleservices, in the context of Portuguese corpus linguistics it is part of a database mainly for phonetic research. The linguistic data were gathered with the help of a prompt list (cf. INESC-ID n.d.b) containing requests for reading a selection of words, phrases, numbers, and spellings of words. The speakers were selected from among employees of Portugal Telecom, their friends and relatives according to standard demands of representativeness. Thus the selection of speakers fulfills the requisites of broad geographical and age coverage and gender distribution. The corpus is annotated and contains information about calling session, recording conditions, speaker sex, age and accent, signal file, recording date and time and phonemic transcriptions.
Due to the aim to design a speech recognition program able to recognize particular information in multimedia news in order to match them with customer-area-of-interest-profiles, a TV broadcast news corpus of European Portuguese was built within the ALERT project at the University of Duisburg (cf. Rigoll 2002). Rádio Televisão Portuguesa cooperated in this project, which is based on 300 hours of multimedia text selected out of 133 news programs. The corpus is processed by a segmentation and a Thesaurus of 22 top domains. Descriptions are found in online publications, however no direct access is possible. Like Speechdata, it is not a corpus designed for linguistic purposes, although it is certainly of interest for linguistic studies.
A radio as well as a TV corpus are generated in the context of the REDIP project (ILTEC n.d., cf. Ramilo/Freitas 2002), an ILTEC (Instituto de Linguística Teórica e Computacional) / CLUL (Centro de Linguística da Universidade de Lisboa) project on spoken European Portuguese. The corpus consists of 216,000 words taken from audio as well as video recordings from the areas of news, science, culture, sports, economy, and opinion with an orthographic transcription and information on date, duration, and name of the radio or TV program. Further information, including dates of genre, text type, speaker dates, etc. is supplied. The corpus is processed by electronic programs such as Corlex, CONCOR and CONCOR.CB.
TV corpora designed directly for linguistic research are found in Barme (2002). Two TV corpora, one of them in European Portuguese (EP) and one in Brazilian Portuguese (BP), are published in the form of a book. The transcribed texts are based on so called “Intimate Talk Shows” in Portuguese and Brazilian TV. They are characterized by the usage of language of proximity, as also indicated in the title of the book, and consist of 23,000 (EP) and 19,000 (BP) words respectively. A simple orthographic transcription system was used, no phonetic or prosodic indications and – besides the transcription head – no annotations are found. The corpus is accessible only in book form. TV corpora are also the basis of Bachmann (2010; 2011a; 2011b) within a research project on the usage of Portuguese language in Brazilian Television and its implications for language ideology. The author does not provide access, but provides a small description of the corpus Bachmann (2011a, 53s.).
Given the social importance and influences of TV in Brazilian Portuguese, multimedia TV corpora for linguistic research on language use in news, talk shows and telenovelas are still a desideratum. As sociological and cultural studies about the Brazilian telenovela show (cf. Motter/Jakubaszko 2005; Martins 2008), multimedia data of this kind are lacking in order to analyze the fictitious presentation of social meaning of linguistic variation in Brazil and its impact on real language use and language attitudes.
The largest (virtual) corpus for Portuguese media linguistics is built out of all possible texts in Portuguese language that can be found on the World Wide Web (cf. Lemnitzer/Zinsmeister 2006, 41). Portuguese language is everywhere on the Web (cf. Branco et al. 2011), which permits the constitution of subcorpora on any kind of topic; though contexts for the metadata are largely absent. Corpus annotation will be possible only in a very reduced way, such as by providing data about language and speech community, date of text production, date of text finding, and possibly some information from the text producer.
As in many languages, social media are used with growing frequency also in Portuguese. There are smaller corpora on blog communication such as studied in Sieberg (2005) as part of an international project on language use in Internet blogs (cf. Schlobinski/Siever 2005). The database for the study of linguistic features in Portuguese blogs is based on a corpus of 30 blogs found in Blogger.com.pt (blogspot.com), Weblog.com.pt, SAPO, and one individual blog which even didn’t exist when the article was published (2005). The corpus is added at the end of the article and consists of a list of Internet directions organized according to the blog topics (cf. Sieberg 2005, 222ss.).
A small Internet corpus for a specific research project is found in Gutierrez Gonzalez (2007). With the aim to 1) analyze Internet-specific changes in Portuguese orthography and 2) analyze the specific frequency of certain lexical items in order to come to findings concerning an “Internet language”, a corpus consisting of 98 blogs, 135,000 tokens and 15,000 types was generated. A discussion about theoretical matters of corpus building is provided. Access to the corpus is provided by a list of links which constitute the corpus.
Generally, well-organized and annotated Portuguese corpora of, e. g., blogs, forum discussions or Twitter of any topic are a desideratum.
The corpora in this section are classified into two different subsections: one for spoken language and one for written language. Although a presentation according to the type of tertiary media would certainly be less idiosyncratic, it would have the disadvantage that some corpora have to appear in two or more types. The corpora presented in this section are quite different in many aspects: sources, extension, annotation, etc. But they are also quite different in their representation of the Catalan media language: some of them span different media, and some of them comprise basically traditional written texts.
All the corpora described in this section contain spoken language (spontaneous and non spontaneous). Some of them consist strictly of audio files, and others contain audio files aligned with orthographical transcriptions, and finally there are some corpora comprising orthographic, phonetic and morphological information, as well as linguistic and extralinguistic encoding.
The Corpus del Català Contemporani (CCCUB) is a large corpus linguistic project started by the Department of Catalan Philology of Barcelona University in the early 1990s. The CCCUB project includes a total of seven subcorpora focusing on spoken language and ranging from a general corpus of dialect samples in different Catalan varieties to rather specific corpora documenting, e. g., news broadcasting speech (Corpus d’Informatius Orals, CIO) or radio advertisements (Corpus Oral de Publicitat, COP). Three of these seven subcorpora document primarily the Central Catalan variety spoken in the Barcelona metropolitan area and they focus on spontaneous or near spontaneous speech. This subgroup is made up of the Corpus Oral de Conversa Coŀloquial (COC), the Corpus de Varietats Socials (COS), an interview corpus based on a balanced sample of 78 informants selected according to basic linguistic variables; and the Corpus Oral de Registres (COR), a corpus of spoken interaction in a large variety of situational contexts and social domains. Although no definite figures are given, each of these spontaneous spoken language corpora seems to contain about 350,000 words, thus yielding a mid-size corpus of about more than a million tokens (cf. Alturo/Boix/Perea 2002).
The Repertori Electrònic de Textos Orals Catalans (‘Electronic Repertoire of Catalan Oral Texts’, RETOC) is an initiative from the Language Engineering Research Group at the Institute for Applied Linguistics (Universitat Pompeu Fabra). It aims at providing several types of users with oral material and contains more than 14 processing hours (already digitalized) from radio news (cf. De Yzaguirre/Farriols/Martí 2007).
The Clinical Interview Corpus (ClInt) is a bilingual Spanish-Catalan spoken corpus that contains 15 hours of recordings divided into 40 clinical interviews. It consists of audio files aligned with multiple-level transcriptions comprising orthographic, phonetic and morphological information, as well as linguistic and extralinguistic encoding. For further information on transcription and encoding guidelines cf. Vila Rigat et al. (2010).
The Arxiu audiovisual dels dialectes catalans de les Illes Balears, at the University of the Balearic Islands (cf. Corbera Pou 2004), which features conversations between adult speakers in the Balearic Islands and which in the future can be consulted, subject to certain restrictions, by interested researchers (cf. Càtedra Alcover-Moll-Villangómez 2003).
The corpus LipTV: Llengua i Publicitat a la Televisió (cf. Pons i Griera et al. n.d.) consists of 1,000 television advertisements in Catalan (broadcast between 1991 and 2000), from which connections between language and non-linguistic information elements – both visual and auditory – present in advertising can be investigated.
The Corpus Glissando (cf. Garrido et al. 2013, cf. also Kallweit below) is a prosodic corpus which includes more than 20 hours of speech in Spanish and Catalan, recorded under optimal acoustic conditions, orthographically transcribed, phonetically aligned and annotated with prosodic information at both the phonetic and phonological levels. Glissando is actually made of two subcorpora: a corpus of read news and a corpus of dialogue material which is further subdivided into a subcorpus of informal conversations, and a set of three task-oriented dialogues, covering three different interaction situations. This corpus has been recorded in Catalan and Spanish by 28 speakers of each language, a collection of professionals and non-professionals.
The Catalan part of the corpora created within the scope of the Web-as-corpus has resulted in two initiatives: Corpus d ’Ús del Català a la Web (CUCWeb, cf. Boleda et al. 2006) and the caWaC (cf. Ljubešić/Toral 2014).
CUCWeb is a 166-million-word corpus automatically compiled from the Web by the GLiCom Group of the Universitat Pompeu Fabra. This corpus has been automatically processed so that additional linguistic information is available (apart from the word forms). It is currently not accessible for the public.
The caWaC corpus is a web corpus of Catalan from documents published on the .cat top-level domain. caWaC has been built with the Brno pipeline (cf. Suchomel/Pomikálek 2012) and it is the largest existing corpus of Catalan containing 780 million tokens. caWac is not annotated and cannot be accessed through an interface, but it is freely available for download (cf. Ljubešić/Toral 2014).
The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words (for further details cf. Reese et al. 2010; download of raw text and tagged version via Reese et al. n.d.).
The Corpus Textual Informatitzat de la Llengua Catalana (CTILC) is the oldest written corpus for Catalan. The process of its compilation at the Institut d’Estudis Catalans began in 1985 and was completed in 1997. It contains 3,399 literary and nonliterary written texts, produced between 1832 and 1988 (cf. Rafel 1994). While this corpus was created with exclusively lexicographic aims, it constitutes an essential source of data for the study of written Catalan genres during the 19th and 20th centuries. The corpus contains 52 million words, 44% of which correspond to literary text and 56% to non-literary text. These two groups are divided into several subgroups; the literary one is divided into genres and the non-literary one into thematic domains like philosophy, religion, social sciences, the press, pure and nature sciences, art, etc. Access and search queries are possible through the CTILC web interface (cf. IEC n.d.).
With respect to specialized language, the most relevant corpus for Catalan is the Corpus Textual Especialitzat Plurilingüe (IULACT) at the Universitat Pompeu Fabra (cf. Cabré/Bach 2004) which contains written texts in five different languages (Catalan, Spanish, English, French and German). The areas of interest include: economics, law, computer science, medicine, environmental science, and linguistic sciences. Texts are selected and classified according to topics proposed by specialists in each area (Law, Economics, Environmental science, Medicine, Computer Science and Linguistics). The texts are tagged according to the standard SGML, following the guidelines proposed by the “Corpus Encoding Standard (CES)”. IULACT can be accessed online (cf. IULA n.d.b).
AnCora consists of a Catalan corpus (AnCora-CA) and a Spanish corpus (AnCora-ES), each of them of 500,000 words. The AnCora corpus is mainly based on journalistic texts. The corpora are annotated at different levels: Lemma and Part of Speech, syntactic constituents and functions, argument structure and thematic roles, semantic classes of the verb, etc. AnCora can be queried online (cf. CLiC n.d.a).
The corpora of Catalan currently available cover an ample portion of the tertiary media spectrum, but especially the traditional ones, like telephone, radio and television. Internet-based media like blogs, chats, newsgroups, SMS and e-mails are scarcely represented in the overview above. The biggest corpora recently created in the wake of the web as corpus tendency (corpora CUCWeb and caWac) include ample portions of traditional written texts. However, in the academic community there are many internal projects, master theses and other initiatives which aim to study the characteristics of the Catalan media language. This fact lets us presuppose that the Catalan corpus linguistics community will soon be able to fill parts of the gap.
Research on French language in tertiary media is complicated due to the fact that there are many collections of audiovisual material available in libraries and other funds, but most of them are not yet processed in any way for linguistic analyses. There are only a few transcriptions or annotated corpora, and many projects seem to have been abandoned over time. The corpus Français parlé des médias, built by the homonymous Swedish research group (cf. Forsgren et al. n.d.), is such an example: The initial goal was to create a corpus from about 50–70 hours of French television broadcasts, containing the digitized audiovisual data and basic orthographic transcriptions. Intended to serve as a database for various linguistic studies (of which several have indeed been carried out and published), the corpus could not be finalized and made publicly available since the work on transcriptions turned out to be too time- and resource-intensive.
Via the CLAPI corpus database, TalkBank offers a 30-minute phone conversation between friends in Canadian French, recorded in 2004, and completely transcribed following the conventions of the corpus research group ICOR (Interactions CORpus), with free access to the audio file, CLAN and TEI files (cf. Laboratoire ICAR 2004).
Also available on the CLAPI platform are 8 online video phone calls (split into 13 files) recorded in 2007–2008, featuring conversations between two persons via MSN on a determined subject. The totality of the conversations lasts nearly 5 hours; each audiovisual file can be downloaded separately. There is only one transcription for a 14-minute conversation available, though (cf. Laboratoire ICAR 2007/2008).
ASILA offers two corpora of telephone conversations, one between students and advisors in an academic Centre d’Information et Orientation (corpus CIO) and one between clients and consultants of the railway company (corpus SNCF). There is no further information available on these corpora, and the website is currently out of order (cf. Universiteit Gent n.d.).
Within the corpora of the project Traitement de Corpus Oraux en Français (TCOF), two private phone conversations – one of five minutes, the other of ten minutes – are available as audio file and transcription (cf. André 2007; André n.d.).
The international project sms4science collected short messages sent via an online interface. The Belgian subcorpus is available on CD-ROM and contains 30,000 SMS in French language (cf. Klein/Paumier/Fairon 2006). Another subcorpus, 88milSMS (cf. Panckhurst et al. 2014), has been established in Southern France, comprises 88,000 SMS and is accessible online (on registration), the documentation can be found in Panckhurst et al. (2013). The subcorpora covering La Réunion – smslareunion, Ledegen (2014), comprising over 12,000 SMS – and the Alp departments – smsalpes, Antoniadis (2014), comprising about 22,000 SMS – can be downloaded with documentation from the CoMeRe repository. The Swiss subcorpus comprises 4,619 SMS written in Standard French and 30 in “French Patois”; it can be used for research on registration (cf. Stark/Ueberwasser/Ruef 2009–2014). The subcorpus from Québec is in the process of normalization and has not yet been published. Cougnon/Fairon (2012) assemble papers dealing with the building process of SMS corpora, methods and project outcomes. A new approach to the language of short messages is the Swiss project What’s up, Switzerland?, in which a collection of WhatsApp messages is by now in the process of corpus building (cf. Stark/Dürscheid/Meisner 2014).
The Institut National de l’Audiovisuel (INA) is a public entity in charge of the collection of audiovisual publications of French radio and television, providing the complete dépôt légal and more resources. By 2017, INA’s stock was of 14.7 million hours of image and sound. Researchers can consult them classified by type of media (TV, radio, cinema, Internet, theatre, and opera) in the eight INAthèque centers; many resources are also available online. The department for audiovisuals of the Bibliothèque Nationale de France provides a collection of all movies available on video tape since 1975. The official stock for French cinema is the Paris-based Cinémathèque Française, an archive for French movies and everything related to them from posters to media coverage.
The European Language Resources Association (ELRA) provides several French media corpora, all of them available on license acquisition. The ESTER corpus (cf. ELRA 2014b), produced in 2007 by a speech recognition research group, contains about 100 hours of orthographically transcribed radio broadcast news, partly annotated, mainly taken from France Inter, France Info, Radio France Internationale and Radio Télévision Marocaine. Further 1,700 hours of news are available within the EPAC corpus (2010, cf. ELRA 2014a), 100 hours of which manually transcribed. ESTER 2 (2012, cf. ELRA 2014c), based on ESTER, additionally contains transcriptions of African radios of about 6 hours. As a follow-up to the ESTER projects, the ETAPE corpus is a mix of 13.5 hours of radio data and 29 hours of TV data. It includes mostly non-planned speech and multiple speaker settings recorded from news, debates and entertainment shows (cf. Gravier et al. 2012). It will be made available on license by the Evaluations and Language resources Distribution Agency (ELDA) after the completion of annotations.
Within the project TCOF (Traitement de Corpus Oraux en Français) there is a radio interview between a radio presenter and an actor. The recording is about 27 minutes long, the audio file and transcription are available online (cf. André 2006).
Another corpus containing radio recordings is C-PROM, elaborated by a research group from four universities (cf. Avanzi et al. 2010; Simon et al. 2010). The totality of this corpus is about 70 minutes of oral speech, of which about 20 minutes comes from radio news and interviews. The aligned and annotated corpus can be downloaded with the corresponding material after previous inscription from the project website. Even though the radiophonic parts of this corpus are not enormous, the possibility to directly compare with other spoken genres within the same corpus is a great advantage.
One of the very rare corpora of audiovisual French is the Corpus transcrit de quelques journaux télévisés français (cf. Lindqvist 2001), who provides the transcription of 19 news broadcasts from the channels TF1, France 2, France 3 and TV5, all from 1993. The book is accompanied by an MP3 disk with the sound files. Lindqvist was originally concerned with phonetic norm, the use of schwa, and the realization of liaison and the negation particle ‘ne’. The transcription follows in general the rules established by the GARS (Groupe Aixois de Recherches en Syntaxe), allowing thus for rather detailed analyses – all to be conducted manually, since the transcripts are not available digitally.
A project focused on person recognition technologies, but nevertheless usable for linguistic analyses, is the REPERE corpus, which is supposed to contain 60 hours of videos with multimodal annotations (speech transcription and video annotation, cf. Giraudel et al. 2012). REPERE will be distributed on license by ELDA.
In spite of the apparent simplicity to build corpora from online data, there are only very few available in French so far. As Falaise (2005) explains, collecting online data entails ethical questions of privacy and authorship which are more difficult to resolve than in the case of other speech data. The special challenges in the field of web corpus building are also the subject of several papers assembled in Calabrese (2011).
The WaCky (Web as Corpus) project has built a corpus of around 1.6 billion words obtained by crawling and post-processing Web data from .fr domains (frWaC). The annotated version contains POS and lemma information. The online interface provides a number of options for linguistic search queries.
Falaise’s Corpus de français tchaté, collected in 2004, comprises 23 million words in 4 million turns. No thematic restrictions were made, the corpus contains conversations on all kinds of topics and pragmatic behavior. The complete corpus is available online (cf. Falaise 2014).
The research group Humanités numériques et data journalisme: le cas du lexique politique elaborated a corpus of political Twitter status updates, Polititweets. The corpus comprises more than 34,000 tweets from 205 politicians, most of them sent during the municipal elections 2014 in France. The corpus is divided into seven TEI folders (one for each political group), available for download (cf. Longhi et al. 2014).
Many corpora of French media have been established for linguistic research but never published, such as Labeau’s corpus of French television news (2007), the TV debates used by Sullet-Nylander/Roitman (2010) and those used by Sandré (2010), Bedijs’s corpus of 24 French youth movies (2012), Abecassis’s corpus of five 1930s French movies (2005), Isosävi’s corpus of 34 recent French movies (2010), Wenz’s corpus of weblogs (2017), Atifi’s corpus of 200 Moroccan message board entries (2007), the message board corpus analyzed by Marcoccia/Gauducheau (2007), the chat and message board corpus used by van Compernolle/Williams (2007), the chat and weblog corpora analyzed by Lorenz/Michot (2012; 2014). This seems to be the case for most individual work for doctoral theses and projects of smaller extent. Several reasons come into play: establishing a corpus from tertiary media requires great effort and time, both factors increasing when the corpus is destined to be published. It must then be coded along a standard that is not necessarily used by the original researcher and stored in a format which makes it available to other users. This work is hardly expectable from a single researcher whose focus and competence is linguistics and not computer science or computational speech processing. Furthermore, researchers often use copyrighted material for their analyses. It would be problematic to make them accessible publicly. This also holds for data involving privacy matters, such as data extracted from phone calls, chats, social media, etc. Anonymization ex post is virtually impossible for online data which can be tracked back for years. This is probably the reason why we still lack corpus studies on massively used devices such as Skype, WhatsApp (now in process) and Facebook, where the individuals would be identified too easily. Yet semi-anonymous platforms like Twitter and open online message boards have neither been the object of systematic corpus building so far.
Besides the fact that there are very few reference corpora, and (almost) each study is carried out on a newly established data set, we can locate further lacunae in the field of media corpora. At present, there are no corpora touching on videogames, a subject that would be highly interesting in terms of human-computer-interaction. There are nowadays some videogames available requiring (spoken or typed) speech activities from the player. So far, no such interactive gaming session has been recorded for linguistic analysis.
Another gap to fill is the representation of Francophonie in media corpora. So far, only the “big” and wealthy Francophone countries have actively taken part in research projects on media linguistics, i.e., France, Switzerland, Belgium and French-speaking Canada. Most other countries and regions are far underrepresented, there are almost no published corpora (apart from smslareunion), and only some corpus studies on selective topics have been carried out. Yet it would be of great interest to compare data between the different regions in order to learn about variational differences in the use of media.
The aim of this contribution is to describe and value the existing tertiary media corpora available for the research in Galician. So far, only a corpus is available for this field – the VEIGA corpus (cf. Sotelo Dios n.d.), which was created, strictly speaking, as a tertiary media corpus. In addition to this, another corpus can be accessed, CORGA (cf. Rojo et al. n.d.), which is the most representative corpus of present Galician. It contains press texts in their original digital format. Parallel to these corpora there are numerous audiovisual archives in Galician, although they do not offer the possibility of selecting and saving linguistic forms. Nevertheless, they are in fact a solid foundation to construct a complete audiovisual media corpus in the future.
The VEIGA corpus is a kind of subcorpus belonging to the Corpus Lingüístico da Universidade de Vigo or CLUVI (cf. SLI 2003). CLUVI is a collection of parallel text corpora open to the public. Its goal is to help analyze the relationship between Galician and other languages (e. g., English, Spanish etc.) in different fields and to reach both academics and translation practitioners (cf. Gómez Guinovart/Sacau Fontenla 2007, 855). In the field of tertiary media corpora, the VEIGA project for subtitling is of special relevance. This corpus consists of English-language films subtitled into both English (intralingual subtitling) and Galician (interlingual subtitling). Words and expressions can be looked up both in Galician as well as in English. The outcome of this search is shown together with the corresponding Galician translation (if the search is done in English, cf. Tab. 1) or with the corresponding form in the original language (if the search is done in Galician, cf. Tab. 2).
The film’s code and the unit for the corresponding translation are shown on the left. More precise information (about the film, the whole unit number for the translation or the person who is responsible for alignment) may be obtained by clicking on the corresponding cell. An arrow on the right makes it possible to save the linguistic context in which the said form appears.
Table 1: Search results for the English word “husband”
1-AFT | [[s n="129" d="00:15:42,903" | [[s n="149" d="00:15:42,703" |
(203) | a="00:15:46,020"]]I want to make love with you right[[l/]]now because you’re my husband... | a="00:15:44,614"]]Quero face-lo amor contigo[[l/]]agora mesmo [[s n="150" d="00:15:44,783" a="00:15:48,492"]] porque e-lo meu marido |
2-AFT | [[s n="483" d="00:54:27,823" | [[s n="544" d="00:54:27,183" |
(736) | a="00:54:30,257"]]My husband won’t have sex with me,[[l/]]either. | a="00:54:30,493"]]O meu marido tampouco quere[[l/]]manter relacións comigo. |
3-AFT | [[s n="528" d="01:00:04,743" | [[s n="600" d="01:00:06,623" |
(799) | a="01:00:08,622"]]What is your husband doing right now? | a="01:00:08,978"]]¿E que anda a facer agora[[l/]]seu marido? |
Table 2: Search results for the Galician word “abofé”
1-PUN | [[s n="314" d="00:19:23,390" | [[s n="309" d="00:19:23,129" |
(340) | a="00:19:26,507"]]I think he’s in[[l/]]for some real bad stuff. | a="00:19:26,228"]]Abofé que algo[[l/]]fixo mao. |
2-PUN | [[s n="384" d="00:22:53,750" | [[s n="380" d="00:22:53,470" |
(422) | a="00:22:55,502"]]- Certainly. [[l/]]- You do. | a="00:22:55,069"]]- Abofé. |
3-PUN | [[l/]]You’re a traitor. | [[l/]]É un traidor. Abofé. |
(791) |
CORGA (Corpus de Referencia do Galego Actual, cf. Rojo et al. n.d.) is a documented corpus including texts from the digital press. In January 2013 it gathered 25.8 million forms. The corpus includes different text types (newspapers, magazines, essays and literary fiction like novels and theatre) from 1975 until the present. Its codification is based on XML.
Searches of words and expressions are possible according to text type, period and topical area or even combining any of these parameters. CORGA’s new version enables search for a work or an author, how many words and documents concern the search or how many words the CORGA contains in each category.
Texts originally on digital format are specifically O Correo Galego (from 1999 to 2002), Galicia Hoxe (from 2005 to 2009), De Luns a Venres (from 2007 to 2009), O Xornal de Galicia (2010), A Nosa Terra (from 2002 to 2010) and the magazines Díxitos, Consumer, Código Cero and Revista Galega de Economía. One recent work which specifically describes this corpus is Domínguez Noya (2008).
The creation and modification of audiovisual archives is in a constant process of development. In the following paragraphs, some of the most important current archives are described so as to offer a formally and topically wide spectrum.
Using the large multimedia archive produced by the Consello da Cultura Galega118 (cf. Consello da Cultura Galega n.d.b), the public can access an important number of audiovisual and sound documents such as lectures, discussions or interviews. Search may be by topic, time or authorship.
CanleTV is an online platform of the Axencia Galega das Industrias Culturais (Agadic), an institution belonging to the Consellería de Cultura, Educación e Ordenación Universitaria and responsible for supporting and spreading Galician culture and its audiovisual products. Although it has mainly cultural and dissemination tasks, this platform also allows for free access to a huge collection of films. The videos are classified in six categories (fiction, animation, documentation, experimental, videoclips and trailers). Subsections and tags are found for each of them according to genre or topic.
The AVG Soportal Audiovisual Galego (cf. Consello da Cultura Galega n.d.a) is an additional support platform for cultural audiovisual products in Galician. It does not only offer information and the possibility to broadcast works, but the possibility for the public to access these products. This section of extras currently includes 1,642 documents (inter alia 825 productions and 11 interviews).
Finally, the project Gzvideos is based on a collection of more than 1,000 videos addressing Galician social issues. The videos are selected according to topic categories.
In a nutshell, these audiovisual archives contain a wide linguistic spectrum ranging from oral to written sources, to different registers and different communicative situations. All in all, they are a valuable source for the study of the Galician language. However, at this moment, there is no single corpus that allows for the selection, characterization and registration of the different linguistic forms from these materials for research.
The corpora in this section are listed and described in alphabetical order. Although a presentation according to the type of tertiary media would certainly be less idiosyncratic, it would have the disadvantage that some corpora have to appear in two or more types. Several of the corpora described in this section span different media, primary, secondary and tertiary, and across different kinds of tertiary media. This is true especially for some of the widely-known reference corpora of spoken Italian that include, besides various kinds of face-to-face speech, recordings of telephone conversations and of radio and television programs (cf. Barbera 2013b and 2013c; Cresti/Panunzi 2013).
The Corpora e lessici dell ’italiano parlato e scritto (CLIPS) contain approximately 100 hours of spoken Italian. The corpus was collected between 2000 and 2004 in 15 major Italian cities and from national radio and television stations. Besides texts read aloud and dialogue elicited by map tasks, it comprises 16 hours from radio and television interaction of various types and 16 hours of telephone speech (collected in a Wizard-of-Oz-setting or from answering machines). Approximately 30% of the recordings have been transcribed. The recordings and transcripts are available at online (cf. Albano Leoni n.d.).
The 58,300-word Corpus di italiano parlato was collected in the period from 1973 to 1998 and includes a few texts (covering about 50 minutes) deriving from telephone conversations and television programs. The orthographic and prosodically annotated transcripts are published in printed form and on a CD comprising the audio files (cf. Cresti 2000).
The Corpus di parlato cinematografico comprises audio recordings and orthographic transcriptions from the first Italian sound film (“La canzone dell’amore”, 1930) and from three newer films (“Quattro passi fra le nuvole”, 1942; “Era di venerdì 17”, 1957; “Il profumo del mosto selvatico” [original title “A walk in the clouds”], 1995). The last two are remakes of “Quattro passi fra le nuvole”. The film “La canzone dell’amore” has been transcribed entirely. Concerning the other three films, the transcripts are limited to four sequences with the same settings, persons, places and discourse subjects. The audio files and transcripts can be downloaded online (cf. Giannini/Pettorino/Vitagliano n.d.a).
The Corpus di parlato telegiornalistico. Anni Sessanta vs. 2005 (CPT) contains approximately 116 minutes of RAI newscasts: four newscasts recorded between 1966 and 1969 and a fictitious newscast recorded in 2005. The recordings and the transcripts are available for download (cf. Giannini/Pettorino/Vitagliano n.d.b).
The Integrated reference corpora for spoken romance languages (C-ORAL-ROM) are a collection of four subcorpora, French, Italian, Portugueseand Spanish, that are comparable in structure and size (300,000 words for each language). Like the other subcorpora, the Italian part contains, besides different types of face-to-face speech, recordings from telephone (25,000 words, mainly from Florence and Tuscany) and from broadcast speech (60,000 words, radio and television). Most of the texts were recorded between 2000 and 2002. A compressed and encrypted version of the collection is published on a DVD attached to the volume presenting the corpus (cf. Cresti/Moneglia 2005). It contains the audio files aligned with the transcripts as well as software for the acoustic and linguistic analysis of the corpus. The whole corpus is part-of-speech-tagged and lemmatized. A non-compressed and non-encrypted version of the corpus can be purchased at the Evaluations and Language resources Distribution Agency (ELDA).
The Italian part of the corpora created within the scope of the Web-As-Corpus Kool Yinitiative, the itWaC corpus, is the largest publicly-documented language resource of Italian. The 1.5 billion word-collection was constructed from the web limiting the crawl to the IT domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. From the description in Baroni et al. (2009) it is not clear, however, during which time frame the web crawl took place. The part-of-speech-tagged and lemmatized corpus can be queried and downloaded (cf. Baroni et al. n.d.a).
The 490,000-word corpus of the Lessico di frequenza dell’italiano parlato (LIP), the oldest reference corpus of spoken Italian, was collected in 1990–1992 in four cities: Milan, Florence, Rome and Naples. It contains amongst others 10 hours from various national and local radio and television programs and 10 hours of spontaneous telephone conversations. Its 469 texts have been transcribed completely and published together with the frequency dictionary (cf. De Mauro et al. 1993). A revised, part-of-speech-tagged and lemmatized version of the transcripts is accessible online (cf. Bellini/Schneider 2006; Schneider/Bellini n.d.); a newly revised, part-of-speech-tagged and lemmatized version, called VoLIP, can be queried together with the audio files online (cf. Voghera et al. n.d.).
The corpus of the Lessico dell’italiano radiofonico (LIR) consists of two subcorpora, LIR1, collected in 1995, and LIR2, collected in 2003. LIR1 comprises 64 hours of transcribed recordings from 9 national radio programs, LIR2 contains 36 hours of transcribed recordings from the three national RAI programs. The recordings and transcripts of the two corpora are available on DVD (cf. Maraschio/Stefanelli 2003) and also online (cf. MICC n.d.b).
The corpus of the Lessico italiano televisivo (LIT or LIT 2006) is a collection of 168 hours of television, recorded casually according to a predetermined timetable during the year 2006 from RAI and Mediaset evening programs. The transcripts and aligned recordings of the entire collection can be queried online (cf. Il portale dell’italiano televisivo 2013 or MICC n.d.c).
The 25-million-word Perugia corpus (PEC), assembled by Spina, has a television section, a film section and a web section. The part-of-speech-tagged and lemmatized corpus can be queried online on registration (cf. Spina 2015).
Although newsgroups operate within a distinct system (Usenet system), they are functionally similar to discussion forums on the web. The Newsgroup UseNet Corpora (NUNC) are a suite of corpora of several languages collected by Barbera starting from 2002 (cf. Algozino 2011; Allora 2011; Cignetti 2011; Barbera 2013a). At present, the Italian part (NUNC-IT) contains 280 million words on general issues as well as on the subjects of cooking, motoring, photography and cinema. The part-of-speech-tagged and lemmatized corpus can be queried online (cf. Barbera n.d.).
The most significant achievement of the Piattaforma per l’apprendimento dell’italiano su corpora annotati (PAISÀ) was the creation of a corpus for pedagogical purposes. The 250-million-word corpus was constructed entirely from the web in September and October 2010 (cf. Borghetti/Castagnoli/Brunello 2011). The majority of its texts stem from sites of the Wikimedia Foundation (Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage). The part-of-speech-tagged and lemmatized corpus can either be downloaded or directly queried online (cf. Baroni et al. n.d.b).
A number of telephone text messages (SMS) can be queried at the SMS Monitor Studies site (cf. Allora n.d.). At present, the corpus contains only 686 messages in Italian (which means it is rather unrepresentative), but according to its editor, Adriano Allora, it is growing with the help of volunteers entering messages. Compagnone (2011), a comparative analysis of Italian and French SMS communication, is based on the corpus.
Twitter is an online service enabling subscribers to post short messages called tweets of up to 140 characters about any kind of subject. The TWITA collection contains 155 million tweets in Italian, assembled via an automatic procedure between February 2012 and June 2013 (cf. Basile/Nissim 2013). The collection can be downloaded (cf. Basile n.d.). These lists are sufficient to recreate the collection (except the tweets that have been deleted).
Undoubtedly, a lot of data concerning tertiary media language remains unpublished, mostly due to copyright issues. The LABLITA Laboratory at the University of Florence hosts a considerable amount of data documenting spoken language, including Italian spoken on telephone, radio and television. Some of it has been published and is mentioned above, e. g., the Corpus di italiano parlato and the C-ORAL-ROM corpus, but a large part of its materials can be accessed only after signing a license agreement. Sciubba (2010) bases her research on an unpublished corpus of 421 e-mails between university students and tutors collected between November 2002 and May 2005. A part-of-speech-tagged corpus of Italian tweets collected by Spina between November 2012 and May 2013 is not publicly available.
Regarding the ongoing corpus projects, there are two that concern Italian spoken on television and one regarding social media. The Corpus di interpretazione televisiva (CorIT) currently comprises approx. 2,700 recordings of simultaneously or consecutively interpreted programs such as talk shows and US and French presidential debates (cf. Falbo 2012). The corpus of the Lessico dell’italiano televisivo (DIA-LIT) aims at extending and completing the data of the LIT corpus, by taking into account the entire history of Italian television starting from 1954. About 40 hours of recordings can already be queried online (cf. MICC n.d.a). The SentiTUT corpus (cf. Patti et al. 2013) currently comprises around 4,400 tweets specifically collected for the analysis of irony in 2011 and 2012 (cf. Bosco/Patti/Bolioli 2013). It will be made available for download in the form allowed by Twitter policies.
The corpora of Italian currently available cover an ample portion of the tertiary media spectrum, from traditional ones, like telephone, radio and television, to most types of Internet-based media, like blogs, chats, newsgroups, and others. However, there are two channels that play a key role in contemporary media interaction, but that are scarcely represented in the overview above: telephone text messages (SMS) and emails. In the first case, there is an individual initiative going on, without satisfactory results as yet; in the second case, we are aware of only one unpublished corpus. In both cases, there are presently no plans or projects in the Italian corpus linguistics community aimed at filling the gap.
From a purely quantitative viewpoint, it might even be true that the Italian media language is better represented than any other kind of its diamesic varieties. We should bear in mind, however, that the huge corpora recently created in the wake of the web as corpus tendency (corpora PAISÀ and itWac) comprise ample portions of traditional written texts. This fact casts doubt on the notion of tertiary media and calls for a distinction between the Internet as mere support of traditional communication on the one hand and as device promoting new forms of linguistic interaction (for a typology cf. Allora 2009) on the other.
Since Spanish is spoken in more than 20 countries distributed on five continents, this section is not meant to be an exhaustive presentation of all the corpora of the Spanish language in the tertiary media. We rather intend to give an overview of the most important corpora of Spanish as a media language compiled as well in Spain as in several Latin-American countries. Those corpora will be described in the following order: First, the published corpora will be listed and analyzed; we will distinguish between corpora that include texts from several tertiary media, corpora of texts from radio and/or television, corpora of transcribed telephone conversations and corpora of web texts. In the remainder of this section, unpublished corpora and corpora that are still in the stage of compilation will be described. In every subsection, the corpora will be listed in alphabetical order.
As we will show, most of the described corpora include a variety of different media types and in only very few cases is it possible to separately access the subcorpora that originate from tertiary media.
The Institut Universitari de Lingüística Aplicada of the Universitat Pompeu Fabra in Barcelona is managing a data bank of neologisms, which is called BOBNEO, Banco de Neologismos. Two of the subcorpora have to be mentioned in the context of the present paper, namely the oral one as well as the subcorpus of spontaneously-written texts. The former comprises the transcriptions of radio broadcast shows, whereas the texts of Internet pages are included in the latter. The aim of the BOBNEO project is to trace neologisms that are coined in the Media and by the Media, i.e., the project is primarily a lexicographical one. Besides the language of the Spanish media, it contains language materials from nine Latin-American countries as well as Catalan data. The texts have been collected since 1988, but it is only since 2000 and 2010 that the data concerning the spontaneously-written texts and the radio have been added. As the study of neologisms is a highly dynamic field, the exact size of neither the entire corpus nor the several subcorpora is indicated. Accessing the databank via a guest-log-in (cf. IULA n.d.a) brings some smaller restrictions, but becoming a registered member – which is free – gives full access to the entire data, which is annotated regarding POS as well as word formation and typographical information. The data from 2004 to 2010 can be searched alternatively via the homepage of the Centro Virtual Cervantes (cf. CVC n.d.).
The Corpus de Referencia del Español Actual (CREA), compiled by the Real Academia Española, has to be considered one of the reference corpora of the Spanish language, comprising 160 million words. It can be queried online (cf. Real Academia Española n.d.). The varieties of the Spanish-speaking countries are represented in a quite unequal measure (the relationship of the European and the Latin-American varieties is 50:50) and the main part of the corpus’ material (ca. 90%) stems from written sources, including the secondary media. The category “Oral” combines texts from primary media with texts from tertiary media and does not represent appropriately the text types arising in the context of the New Media. The core of the oral subcorpus comes from the “classical” tertiary media radio and television; the data was contributed in large part by other projects, like, e. g., the CORLEC or the DIES-RTV project (cf. below). The exact composition of this subcorpus is still to be published, but there is a list of the used corpora of spoken Spanish at Real Academia Española (2017). CREA also incorporates transcripts of telephone conversations.
The subcorpus of the 20th century of the Corpus del Español compiled by Mark Davies – which can be consulted online (cf. Davies 2016) – contains a considerable portion of texts from the tertiary media. One reason for this is that Davies used parts of CORLEC, which consists to a great extent of texts taken from radio and TV, as we shall describe below. Besides that, Davies integrates many interviews with Mexican politicians, which had been published as audio files with the respective orthographic transcriptions on the websites of the political parties PRI (Partido Revolucionario Institucional) and PAN (Partido Acción Nacional). The user benefits primarily from the search interface of the corpus, but not so much from the representation of new corpus material.
Within the project Corpus del Habla en Almería, driven by the ILSE research group at the Universidad de Almería, the oral discourse of this Spanish city is analyzed. Since the researchers want to give a representation of their object that is as exhaustive as possible, they also included recordings of TV and radio shows as well as of telephone conversations. At the moment, only the transcriptions of the 108 semi-directed interviews and the respective metadata are available online (cf. ILSE 2005), but it is planned to make the rest of the corpus available and searchable in a foreseeable future.
In contrast to this, the Corpus Oral de Referencia de la Lengua Española Contemporánea (CORLEC) is available as a whole online (cf. Laboratorio de Lingüística Informática n.d.). This macro-corpus of the spoken variety of European, i.e., peninsular Spanish, consists of 1,100,000 words that were recorded between 1990 and 1992 and which were transcribed orthographically and got a prosodic annotation. Almost half of the texts (46.4%) stem from tertiary media (journalistic subcorpus), but all the other subcorpora also include texts from radio and/or television; furthermore, the category “conversations” includes telephone conversations. The corpus and the transcriptions can be downloaded from the website of the Laboratorio Lingüístico Informático at the Universidad Autónoma de Madrid mentioned above.
The Spanish subcorpus of the Integrated reference corpora for spoken romance languages (C-ORAL-ROM) counts about 300,000 words and was compiled from 2001 to 2004 at the same Laboratorio Lingüístico Informático in Madrid. As the subcorpora of French, Italian and Portuguese, it does not only consist of recordings of face-to-face conversations, but also includes recordings of (informal) telephone conversations and of several text types from the tertiary media radio and TV, like news, reports, interviews, talk shows etc. The audio files aligned with the transcriptions, which were lemmatized and annotated prosodically and morphologically, are available through the publication of the book and the DVD by Cresti/Moneglia (2005).
In contrast to the corpora described before, the Corpus Glissando (cf. also Colominas above) contains only texts from tertiary media, namely, more precisely, the radio. This bilingual (parallel) corpus (Spanish and Catalan) was compiled in a cooperation of the Universitat Autònoma de Barcelona, the Universitat Pompeu Fabra (also Barcelona) and the Universidad de Valladolid. It consists of three subcorpora: the radio news corpus, which includes real news texts of Cadena SER that were recorded with professional newsreaders, one corpus of task-oriented dialogues, and one corpus of free, spontaneous conversations. The language material counts more than 20 hours of speech that were aligned with the orthographic transcription and provided with prosodic and phonetic information. After a registration, the corpus can be accessed online (cf. Aguilar/Garrido Almiñana n.d.).
The CoMIT (Corpus Multimodal de Informativos Televisados) was created at the Centre de Llenguatge i Computació at the Universitat de Barcelona and contains the transcriptions of about 6 hours of Spanish newscasts (almost 100,000 words). Although it is a relatively small corpus, it contains the language material of nine national news broadcasts transmitted in 2002 by Televisión Española (TVE 1 and La 2) and Antena 3. It cannot only be used for linguistic problems, but also for media theoretical analysis of the interaction between image, sound and speech within the genre of newscast. To use the corpus a registration is necessary (cf. CLiC n.d.b).
A corpus which contains only Latin American Spanish is El Grial, a project of the Pontificia Universidad Católica de Valparaíso in Chile under the aegis of Giovanni Parodi. It combines an interface for annotating texts with an interface for consulting several corpora compiled in the frame of the project. Thanks to the possibility of selecting certain subcorpora, one is able to access only the language material from the tertiary media, like the subcorpus NOTICENTV. Originally collected in 2000, but constantly expanded until 2011, it contains the transcription of the newscasts from four Chilean TV stations that are lemmatized and multi-level annotated, i.e., tagged morphologically, syntactically and functionally. The subcorpus for the year 2000 contains about 85,000 words. El Grial can be consulted freely online (cf. Parodi n.d.).
The 1997 Spanish Broadcast News Transcripts were collected in the frame of the Hub4 project as part of a training set for a speech recognition software. The corpus contains the speech data of 30 hours of broadcast news as well as the aligned transcriptions that have further annotations, most of which are not relevant for “normal” linguistic purposes. Since the data was collected from one Mexican TV station (Televisa) and two US-American stations (Univisión and Voice of America), it represents the North American variety of the Spanish language. The corpus is available as web download (cf. Munoz/Alabiso/Graff 1998), but has to be paid.
Another small corpus of tertiary media texts compiled by Santillán (2009) contains 150 text messages (SMS), 30 e-mails and two transcripts of chats, not only in Spanish, but also in Italian and German. The SMS as well as the e-mails were classified corresponding to the age of the respective writers. Although the main intention of Santillán’s study was the comparison of the Spanish, Italian and German youth languages, her corpus is of great interest since it is published entirely both in the printed version of her dissertation and in the digital one, which can be retrieved online (cf. Santillán 2009).
An even smaller corpus of Spanish text messages was compiled by Paredes (2008) and consists of 70 SMS, which can be accessed freely via Internet.
Concerning corpora of transcribed telephone conversations, the Corpus del español conversacional de Barcelona y su área metropolitana, compiled by the group GRIESBA (Grupo de Investigación del Español de Barcelona) at the Universitat de Barcelona, includes – besides informal face-to-face conversations – transcribed discourses by telephone. A first part of the corpus was published by Vila Pujol (2001) and it seems that the digital processing of the data is also intended.
Similar to the 1997 Spanish Broadcast News Transcripts, the Spanish-SpeechDat (M) was compiled to gather phonetic and prosodic data for the development of several natural language processing applications. It is a corpus of language data collected via telephone, which – in contrast to the other corpora described in this section – do not represent natural, i.e., spontaneous speech, but consist of queries (the interlocutors were, for example, asked several numbers or to spell certain words). The corpus – which is purchasable on CD-ROM (cf. ELRA 1999) – consists of 1,002 recordings and the corresponding transcriptions.
Regarding corpora of web texts, we have to make the distinction between “web for corpus” and “web as corpus”. An example for the former are the esTenTen and the esAmTenTen corpus, which include more than 8 billion words from 19 Spanish-speaking countries (although there is no proportional relation between the included words from one country and the number of speakers residing in this country). Both corpora comprise the textual material from websites that were classified by their URL corresponding to a country whose official language is Spanish (this is the reason why the US-American variety of Spanish is not represented in the esAmTenTen corpus). Those texts are lemmatized, POS-tagged and provided with information about their geographical origin. Like most web corpora, esTenTen and esAmTenTen inevitably include texts from traditional secondary media, like newspapers that have an Internet presence; private CMC genres, as e-mails or chat, are most likely not represented in these corpora, which can be queried after paid registration (cf. Sketch Engine 2015a and 2015b). Since both corpora belong to the TenTen family, they are designed as monitoring corpora, i.e., they are enhanced and actualized every year or every second year.
The Wikicorpus is a trilingual corpus (English, Spanish, and Catalan; cf. also Colominas above) that comprises over 750 million words from the 2006 version of Wikipedia, 120 million of which are in Spanish. The texts were lemmatized, POS-tagged and enriched with a semantic annotation by a group of researchers at the Universitat Politècnica de Catalunya, Barcelona, and the Universidad del País Vasco, Donostia. All three subcorpora can be downloaded freely (cf. Reese et al. n.d.).
Coming to “web as corpus”, one certainly has to mention the WebCorp developed at the Birmingham City University, which has as its motto “Concordance the web in real-time”. On the website (cf. RDUES 2016), one can enter any lexical item and gets a KWIC-concordance for the queried word, sorted according to the individual sites on which they appear. Since it is possible to select the language and/or the country code, the results can be filtered beforehand. The full texts are usually accessible by clicking on the keyword in question.
Since the corpora described in the following section of this paper are not published (yet), this overview is not meant to be exhaustive either. There are certainly many more corpora of Spanish in the tertiary media, but it is virtually impossible to find and name them all.
The Corpus Cumbre was compiled at the Universidad de Murcia in the 1990s and conceived as a monitoring corpus of the Spanish language. As the corpus was financed privately by the publishing house SGEL, it is not accessible. It represents European as well as American Spanish, although not in a balanced way: 65% of the texts come from Europe, while the remaining 35% are from Latin America. The subcorpus of the tertiary media is integrated completely in the corpus of oral speech and comprises, besides different text types of primary medial nature, transcriptions of telephone conversations and TV and radio broadcasts, all of which had to fulfill the precondition of spontaneity. The three dimensions of the diasystem were also taken into account during the compilation of the Corpus Cumbre, so that it can be used even for detailed diatopic studies: both Spain and Latin America are subdivided into – linguistically quite comprehensible – zones. For further information concerning Cumbre cf. Sánchez (1995).
The Asociación de Academias de la Lengua Española (ASALE) decided in 2007 the compilation of the Corpus del español del siglo XXI – CORPES XXI, which is meant to be continuation of the CORDE and CREA corpora described above. Based on 300 million words from the years 2000 to 2011, it is designated to become a panhispanic monitoring reference corpus, which will be expanded by 25 million words per year. While in the other corpora of the Real Academia Española, the focus lay clearly on the European variety of the Spanish language, the CORPES XXI will include more texts from Latin America: the ratio will be 70:30. Although the provisional beta version of the corpus, which can be consulted online (cf. Real Academia Española 2016), does not include any oral texts, it is foreseeable that a good portion of the oral subcorpus will stem from the tertiary media: the typology includes the categories entrevista, reportaje and noticia (de radio o televisión), and 7.5% of the corpus of written texts will come from the Internet, though it is not clear if these texts will be predominantly texts from secondary media published on the Internet.
Another project that continues an already existing corpus is the Corpus del Español Mexicano Contemporáneo II (CEMC II) currently under construction at the Colegio de México. The first CEMC comprises the years 1921–1974; its continuation will represent the Mexican variety of the Spanish language between 1975 and 2012. While the former did not take into account any texts from the tertiary media, the latter will also include texts from the Internet, like blogs, e-mails, etc. According to the official website of the corpus (cf. Lara n.d.), it should have been presented in 2014, but nothing has been published yet.
The Corpus Difusión Internacional del Español por Radio y Televisión (DIES-RTV) is an internationally compiled corpus started in 1993 under the aegis of Raúl Ávila from the Colegio de México, who later changed the name of the corpus to DIES-M (Difusión del Español por los Medios de Comunicación Masiva) taking into account print media and the Internet. Most of the transcribed texts stem from radio and TV broadcasts and were collected in 20 Spanish-speaking countries. Initially, it was intended to include a minimum of 50,000 words from each country, but according to Vera (1997), Spain contributed about 75,000 words to the corpus. Unfortunately, it was not possible to find any information about the subcorpora of DIES corpus.
Besides these big corpus projects that have not been published (yet), there are several studies which are obviously based on smaller corpora of tertiary media texts. The remainder of this section will be dedicated to them, following the alphabetical order of the authors’ last names.
Alcántara Plá (2014) analyzes a corpus of 176,000 graphical words, sent via the popular messenger application WhatsApp, with regard to the differences that this form of communication displays in comparison to both written and spoken texts.
A quite big corpus of Spanish on the Internet was compiled by Ávila in 2007, who examines the official websites of the national governments of every Spanish-speaking country – as well as of Puerto Rico and the USA –, comparing them to personal blogs from these 21 countries. Each subcorpus (i.e., the official website-corpus and the blogcorpus) contains more than 21,000 graphic words, which were edited by marking only proper names, orthographic variants, abbreviations and other graphic variants, in order to prepare the texts for the computerized analysis, which aimed at sentence length, text density and lexical characteristics of both corpora compared.
In the frame of a short article, Blas Arroyo (2010) investigates the persona of one of the judges in the Spanish casting show “Operación Triunfo”, Risto Mejide, using a pragmalinguistic approach. Unfortunately, the author does not describe his corpus in detail, but only mentions that all the interventions of Mejide in the shows broadcast between 2006 and 2008 entered the analysis, which had the aim to determine how the (im)polite image of this person was constructed via television.
An apparently not published corpus of Spanish text messages (SMS) is used by Boudrique/Catapano/Pollet (2008). Unfortunately, no further information about this corpus is available, except for the mention of the corresponding Master’s assignment at the Université Paul-Valéry Montpellier 3 in the frame of Panckhurst’s article (2010).
Kallweit (2015) analyzes the log files of several chat sessions – recorded on four days in 2009 and one day in 2010 – with specific reference to alternative spelling strategies. His corpus comprises 160,329 graphic words, which represent the spontaneous chat messages of the recorded users on the platform <www.irc-hispano.es>.
In his PhD-thesis, López Martín (2011) examines impoliteness in the radio shows of one Spanish presenter, Federico Jiménez Losantos. For that purpose, 83 audio archives from 52 different shows were transcribed and analyzed; the radio shows were broadcast between 2007 and 2009 on the Spanish national station Cadena COPE, and between 2009 and February 2011 on esRadio. The vast majority of the analyzed audio files dealt with political commentaries, while a smaller part of them cover other topics, like sports, yellow press contents, or informal talks. López Martín selected the radio shows in an aleatory way, following no pattern.
Perelló-Oliver/Muela-Molina (2013) study radio advertisement from four Spanish national radio stations (Cadena Ser, Onda Cero, Cadena COPE and Punto Radio), recorded in June 2009, regarding their (linguistic) content. The corpus comprises 430 spots, fully transcribed.
Perona Páez (2007) also analyzes radio advertisement, using a corpus of 469 spots, which were broadcast in April and October 2005 during what the author considers the prime time of radio, i.e., between 9:00 and 11:00. The recorded radio stations are the same as in Perelló-Oliver/Muela-Molina’s study, but Perona Páez does not only concentrate on the linguistic content of the spots, but examines them regarding their potential for innovation. In his publication with Barbeito Veloso, Perona Páez (2008) uses the same corpus as for his article from 2007. Both authors also consider the music, sound effects and silences during their analysis.
As has been shown on the preceding pages, there are as yet no representative corpora of the Spanish language in the tertiary media. Most of the described corpora are subsections of larger corpora, which contain several modes of communication, i.e., are not limited to the tertiary media. The existing possibilities to search the corpora described above are still insufficient due to the mostly imprecise labeling of the subcorpora used by the respective authors. Some use a technically-materially defined concept of the term medium, e. g., “the medium Internet”, while others base their labeling on the criterion of the distribution of the respective medium/text, i.e., the concept of mass media, including originally printed texts that are published online afterwards. Accessing and consequently analyzing the described corpora regarding only tertiary media texts hence poses difficulties to any linguist, since the results of several combined queries – in several subcorpora – have to be filtered manually, which is not only time-consuming, but also laborious. We hope that the yet unpublished corpora will be of easier access and will start to fill the gap identified in this overview; furthermore, it is a clear desideratum to have access also to more of the smaller corpora described in section 6.3 of this article. It is desirable that more researchers share the basis of their studies with a more general public and do not keep these corpora under lock and key.
Abecassis, Michaël (2005), The Representation of Parisian Speech in the Cinema of the 1930s, Oxford et al., Lang.
Albano Leoni, Federico (n.d.), CLIPS – Corpora e Lessici dell’Italiano Parlato e Scritto, <www.clips.unina.it> (11.01.2017).
Alcántara Plá, Manuel (2014), Las unidades discursivas en los mensajes instantáneos de wasap, Estudios de Lingüística del Español 35:1, 214–233.
Algozino, Elisa (2011), Lessico e variazione di registro: un confronto tra i corpora NUNC, LIP e Athenaeum, in: Massimo Cerruti/Elisa Corino/Cristina Onesti (edd.), Formale e informale. La variazione di registro nella comunicazione elettronica, Roma, Carocci, 183–203.
Allora, Adriano (2009), Variazione diamesica generale nelle comunicazioni mediate dalla rete, Rassegna Italiana di Linguistica applicata 3, 147–170.
Allora, Adriano (2011), Annotazioni sulla sintassi dell’italiano di registro alto nei newsgroup, in: Massimo Cerruti/Elisa Corino/Cristina Onesti (edd.), Formale e informale. La variazione di registro nella comunicazione elettronica, Roma, Carocci, 204–220.
Allora, Adriano (n.d.), SMS Monitor Studies, <www.e-allora.net/SMS/ms_index.php> (11.01.2017).
Alturo, Núria/Boix, Emili/Perea, M. Pilar (2002), Corpus de Català Contemporani de la Universitat de Barcelona (CUB), A general presentation, in: Claus Pusch/Wolfgang Raible (edd.), Romanistische Korpuslinguistik, Tübingen, Narr, 155–170.
André, Virginie (2006), Corpus avoixnue_06, <http://www.cnrtl.fr/corpus/tcof/rechercher/info.php?handle=11858/00-175C-0000-0023-E00E-F> (11.01.2017).
André, Virginie (2007), Corpus TEL_MAZ_07, <http://www.cnrtl.fr/corpus/tcof/rechercher/info.php?handle=11858/00-175C-0000-0001-44FC-4> (11.01.2017).
André, Virginie (n.d.), Corpus telephone_lam_13, <http://www.cnrtl.fr/corpus/tcof/rechercher/info.php?handle=11858/00-175C-0000-0023-DF86-8> (11.01.2017).
Antoniadis, Georges (2014), Corpus de SMS réels dans les Alpes smsalpes, in: Thierry Chanier (ed.), Banque de corpus CoMeRe, Nancy, Ortolang, <http://hdl.handle.net/11403/comere/cmr-smsalpes/cmr-smsalpes-tei-v1> (11.01.2017).
Atifi, Hassan (2007), Continuité et/ou rupture dans l’Internet multilingue: quelle langue parler dans un forum diasporique?, Glottopol 10, 113–126.
Avanzi, Mathieu, et al. (2010), C-PROM: An annotated corpus for French prominence studies, in: Proceedings of Prosodic Prominence: Perceptual and Automatic Identification, Proceedings of Speech Prosody 2010 Satellite Workshop (Chicago, Illinois, USA, May 10, 2010), <http://speechprosody2010.illinois.edu/papers/102005.pdf> (11.01.2017).
Ávila, Raúl (2007), La lengua española en el ciberespacio: páginas oficiales y personales, <http://congresosdelalengua.es/cartagena/ponencias/seccion_2/25/avila_raul.htm> (16.07.2017).
Bacelar do Nascimento, Maria Fernanda, et al. (2005), The Portuguese Corpus, in: Emanuela Cresti/Massimo Moneglia (edd.), C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages, Amsterdam/Philadelphia, Benjamins, 163–207.
Bachmann, Iris (2010), Planeta Brasil: Language practices and the construction of space on Brasilian TV abroad, in: Sally Johnson/Tommaso Milan (edd.), Language Ideologies and Media Discourses. Text, Practices, Politics, London/New York, continuum, 81–100.
Bachmann, Iris (2011a), “A Gente é Latino”: the Making of New Cultural Spaces in Brazilian Diaspora Television, in: Nuria Lorenzo-Dus (ed.), Spanish at Work: Analysing Institutional Discourse across the Spanish-Speaking World, London, Palgrave Macmillan, 50–66.
Bachmann, Iris (2011b), Norm and Variation on Brazilian TV Evening News Programmes: The Case of Third-person Direct-object Anaphoric Reference, Bulletin of Hispanic Studies 88:1, 1–22.
Barbera, Manuel (2013a), Molti occhi sono meglio di uno: saggi di linguistica generale 2008–12, Milano, Qu.A.S.A.R.
Barbera, Manuel (2013b), Linguistica dei corpora e linguistica dei corpora italiana. Un’introduzione, Milano, Qu.A.S.A.R.
Barbera, Manuel (2013c), Linguistica dei corpora, in: Gabriele Iannaccaro (ed.), La linguistica italiana all’alba del terzo millennio (1997–2010), Roma, Bulzoni, 581–598.
Barbera, Manuel (n.d.), NUNC – A Multilanguage Suite of Newsgroups Corpora, <http://www.bmanuel.org/projects/ng-HOME.html> (11.01.2017).
Barme, Stefan (2002), Corpus des phonisch-nähesprachlichen Brasilianisch und europäischen Portugiesisch, Germersheim, Johannes Gutenberg-Universität Mainz, Centro de Estudios Latino-americanos/Institut für Romanistik.
Baroni, Marco, et al. (2009), The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora, Language Resources and Evaluation 43, 209–226.
Baroni, Marco, et al. (n.d.a), ItWaC – Italian WaCky (Web as Corpus), <wacky.sslmit.unibo.it> (11.01.2017).
Baroni, Marco, et al. (n.d.b), PAISÀ – Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati, <www.corpusitaliano.it> (11.01.2017).
Basile, Valerio (n.d.), TWITA, <http://valeriobasile.github.io/twita/downloads.html> (11.01.2017).
Basile, Valerio/Nissim, Malvina (2013), Sentiment analysis on Italian tweets, in: Alexandra Balahur/Erik van der Goot/Andrés Montoyo (edd.), Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA 2013), Atlanta (GA), 14 June 2013, Atlanta, Association for Computational Linguistics, 100–107.
Bedijs, Kristina (2012), Die inszenierte Jugendsprache. Von “Ciao, amigo!” bis “Wesh, tranquille!”: Entwicklungen der französischen Jugendsprache in Spielfilmen (1958–2005), München, Meidenbauer.
Beißwenger, Michael (2008), Corpora zur computervermittelten (internetbasierten) Kommunikation, in: Anke Lüdeling/Merja Kytö (edd.), Corpus Linguistics. An International Handbook, vol. 1, Berlin/New York, de Gruyter, 292–308.
Bellini, Daniele/Schneider, Stefan (2006), Spoken Italian online: the Banca dati dell’italiano parlato (BADIP), in: Bernhard Kettemann/Georg Marko (edd.), Planing, Gluing and Painting Corpora. Inside the Applied Corpus Linguist’s Workshop, Frankfurt am Main et al., Lang, 13–26.
Blas Arroyo, José Luis (2010), La descortesía en contextos de telerrealidad mediática. Análisis de un corpus español, in: Francesca Orletti/Laura Mariottini (edd.), (Des)cortesía en español. Espacios teóricos y metodológicos para su estudio, Roma/Stockholm, Università degli Studi Roma Tre/EDICE, 183–207.
Boleda, Gemma, et al. (edd.) (2006), CUCWeb: a Catalan corpus built from the Web, in: Proceedings of the 2nd International Workshop on Web as Corpus, April 2006, Trento (Italy), Stroudsburg PA, Association for Computational Linguistics, 19–26.
Borghetti, Claudia/Castagnoli, Sara/Brunello, Marco (2011), I testi del web: una proposta di classificazione sulla base del corpus PAISÀ, in: Massimo Cerruti/Elisa Corino/Cristina Onesti (edd.), Formale e informale. La variazione di registro nella comunicazione elettronica, Roma, Carocci, 147–170.
Bosco, Cristina/Patti, Viviana/Bolioli, Andrea (2013), Developing corpora for sentiment analysis: the case of irony and Senti-TUT, IEEE Intelligent Systems 28, 55–63.
Boudrique, Dorian/Catapano, Hélène/Pollet, Sonia (2008), SMS espagnols & typologie des SMS français, Master’s Assignment, Université Paul-Valéry Montpellier 3.
Branco, António, et al. (2011), The Portuguese Language in the Digital Age / A língua portuguesa na era digital, <http://www.meta-net.eu/whitepapers/e-book/portuguese.pdf> (11.01.2017).
Cabré, M. Teresa/Bach, Carme (2004), El corpus tècnic del IULA: corpus textual especializado plurilingüe, Panace@ – Boletín de Medicina y Traducción 5:16, 173–176.
Calabrese, Laura (ed.) (2011), L’internet, corpus sauvage. Nouvelles ressources, nouveaux problèmes?, Le discours et la langue 2:1, Special issue.
CanleTV, o audiovisual galego (n.d.), <http://canletv.agadic.info> (11.01.2017).
Càtedra Alcover-Moll-Villangómez (2003), Arxiu audiovisual dels dialectes catalans de les Illes Balears, <http://www.uib.cat/catedra/camv/camv/audiovisual.htm> (11.01.2017).
Cignetti, Luca (2011), Note sull’impiego dei segni di interpunzione nella comunicazione mediata dal computer. Forme e funzioni del segno di virgoletta nel corpus NUNC, in: Massimo Cerruti/Elisa Corino/Cristina Onesti (edd.), Formale e informale. La variazione di registro nella comunicazione elettronica, Roma, Carocci, 182–171.
Cinémathèque Française (2003), <http://www.cinematheque.fr> (11.01.2017).
(CLiC) (n.d.a) = Centre de Llenguatge i Computació, AnCora – Corpus del català i de l’espanyol, <http://clic.ub.edu/corpus/en/ancora-search> (11.01.2017).
(CLiC) (n.d.b) = Centre de Llenguatge i Computació, CoMIT – Corpus Multimodal De Informativos Televisados, <http://clic.ub.edu/corpus/es/comit> (11.01.2017).
Compagnone, Maria Rosaria (2011), Verba volant, scripta etiam (Le parole volano, e anche le cose scritte): Comunicazione “schermo a schermo”: uno scritto che cerca di avvicinarsi all’orale, Doctoral thesis, Université Paris Ouest Nanterre La Défense.
Consello da Cultura Galega (n.d.a), AVG – O soportal do audiovisual galego, <http://www.culturagalega.org/avg> (11.01.2017).
Consello da Cultura Galega (n.d.b), Mediateca, <http://www.consellodacultura.org/mediateca> (11.01.2017).
Corbera Pou, Jaume (2004), L’“Arxiu audiovisual dels dialectes catalans de les Illes Balears”, in: Maria Pilar Perea (ed.), Dialectologia i recursos informàtics, Barcelona, PPU, 117–134.
Cougnon, Louise-Amélie/Fairon, Cédrick (edd.) (2012), SMS Communication: A Linguistic Approach, Linguisticae Investigationes 35:2, Special issue.
Cresti, Emanuela (2000), Corpus di italiano parlato, Firenze, Accademia della Crusca.
Cresti, Emanuela/Moneglia, Massimo (edd.) (2005), C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages, Amsterdam/Philadelphia, Benjamins.
Cresti, Emanuela/Panunzi, Alessandro (2013), Introduzione ai corpora dell’italiano, Bologna, Il Mulino.
(CVC) (n.d.) = Centro Virtual Cervantes, Banco de neologismos, <http://cvc.cervantes.es/lengua/banco_neologismos/busqueda.asp> (11.01.2017).
Davies, Mark (2016), El Corpus del Español, <http://www.corpusdelespanol.org> (11.01.2017).
De Mauro, Tullio, et al. (1993), Lessico di frequenza dell’italiano parlato, Milan, Etas Libri.
De Yzaguirre, Lluís/Farriols, Antoni J./Martí, Jaume (2007), El corpus RETOC: un corpus oral per a la recerca i la docència, in: Sadurní Martí et al., Actes del Tretzè Coŀloqui Internacional de Llengua i Literatura Catalanes, Universitat de Girona, 8–13 de setembre de 2003, vol. II, Barcelona, Publicacions de l’Abadia de Montserrat, 495–504.
Domínguez Noya, Eva (2008), O “Corpus de Referencia do Galego Actual (CORGA)”: presente e futuro, in: Ernesto González Seoane/Antón Santamarina/Xavier Varela Barreiro (edd.), A lexicografía galega moderna: Recursos e perspectivas, Santiago de Compostela, Consello da Cultura Galega/Instituto da Lingua Galega, 139–151.
(ELRA) (1999) = European Language Resources Association, Spanish SpeechDat(M), <http://catalog.elra.info/product_info.php?products_id=721> (11.01.2017).
(ELRA) (2014a) = European Language Resources Association, EPAC Corpus: orthographic transcriptions, <http://catalog.elra.info/product_info.php?products_id=1119> (11.01.2017).
(ELRA) (2014b) = European Language Resources Association), ESTER Corpus – Evaluation of Broadcast News enriched transcription systems, <http://catalog.elra.info/product_info.php?products_id=999> (11.01.2017).
(ELRA) (2014c) = European Language Resources Association, ESTER 2 Corpus – Evaluation of Broadcast News enriched transcription systems 2, <http://catalog.elra.info/product_info.php?products_id=1167> (11.01.2017).
Falaise, Achille (2005), Constitution d’un corpus de français tchaté, in: Nicolas Hernandez/Guillaume Pitel (edd.), Actes de la 7e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL'2005), Dourdan (France), <http://www.atala.org/taln_archives/RECITAL/RECITAL-2005/recital-2005-long-010.pdf> (11.01.2017).
Falaise, Achille (2014), Corpus de français tchaté getalp_org, in: Thierry Chanier (ed.), Banque de corpus CoMeRe, Nancy, Ortolang, <https://repository.ortolang.fr/api/content/comere/v1/cmr-getalp_org> (11.01.2017).
Falbo, Caterina (2012), CorIT (Italian Television Interpreting Corpus): classification criteria, in: Francesco Straniero Sergio/Caterina Falbo (edd.), Breaking Ground in Corpus-based Interpreting Studies, Bern et al., Lang, 155–185.
Forsgren, Mats, et al. (n.d.), Spoken French in the media (Le français parlé des médias), <http://www.klassiska.su.se/english/research/research-projects/spoken-french-in-the-media-le-français-parlé-des-médias-1.15464> (11.01.2017).
frWaC (n.d.), French WaCky (Web as Corpus), <http://nl.ijs.si/noske/all.cgi/corp_info?corpname=frwac> (11.01.2017).
Garrido, Juan María, et al. (2013), Glissando: a corpus for multidisciplinary prosodic studies in Spanish and Catalan, Lanugage Resources and Evaluation 47:4, 945–971.
Giannini, Antonella/Pettorino, Massimo/Vitagliano, Ilaria (n.d.a), Corpus di parlato cinematografico, <http://www.parlaritaliano.it/index.php/it/dati/659-corpus-di-parlato-cinematografico> (11.01.2017).
Giannini, Antonella/Pettorino, Massimo/Vitagliano, Ilaria (n.d.b), Corpus di parlato telegiornalistico. Anni Sessanta vs. 2005, <http://www.parlaritaliano.it/index.php/it/dati/650-corpus-di-parlato-telegiornalistico-anni-sessanta-vs-2005> (11.01.2017).
Giraudel, Aude, et al. (2012), The REPERE Corpus: a multimodal corpus for person recognition, in: Nicoletta Calzolari et al. (edd.), Proceedings of LREC-2012, <http://www.lrec-conf.org/proceedings/lrec2012/pdf/707_Paper.pdf> (18/03/2016).
Gómez Guinovart, Xavier/Sacau Fontenla, Elena (2007), Técnicas de procesamento lingüístico-computacional de corpus paralelos no CLUVI (Corpus Lingüístico da Universidade de Vigo), in: Pablo Cano López (ed.), Actas del VI Congreso de Lingüística General, Santiago de Compostela, 3–7 de mayo de 2004, Madrid, Arco Libros, 855–864.
Gravier, Guillaume, et al. (2012), The ETAPE corpus for the evaluation of speech-based TV content processing in the French language, in: Nicoletta Calzolari et al. (edd.), Proceedings of LREC-2012, online: <http://www.lrec-conf.org/proceedings/lrec2012/pdf/495_Paper.pdf> (11.01.2017).
Gutierrez Gonzalez, Zeli Miranda (2007), Lingüística de Corpus na análise do Internetês, São Paulo, Pontifícia Universidade Católica de São Paulo.
GzVideos (n.d.), GzVideos – GalizaVideos, <http://www.gzvideos.info> (11.01.2017).
(IEC) (n.d.) = Institut d’Estudis Catalans, CTILC – Corpus textual informatitzat de la llengua catalana, <http://ctilc.iec.cat> (11.01.2017).
Il portale dell’italiano televisivo (2013), <http://www.italianotelevisivo.org> (11.01.2017).
ILSE (2005), Corpus del Habla en Almería, <http://nevada.ual.es/otri/ilse/corpus.asp> (11.01.2017).
(ILTEC) (n.d.) =Instituto de Linguística Teórica e Computacional, Corpus REDIP (Rede de Difusão Internacional do Português: rádio, televisão e imprensa), <http://www.iltec.pt/?action=concord&lang=p> (11.01.2017).
(INESC-ID) (n.d.a) = Instituto de Engenharia de Sistemas e Computadores – Investigação e Desenvolvimento, SPEECHDAT corpus for European Portuguese, <https://www.l2f.inesc-id.pt/wiki/index.php/SPEECHDAT_Corpus> (11.01.2017).
(INESC-ID) (n.d.b) = Instituto de Engenharia de Sistemas e Computadores – Investigação e Desenvolvimento, SPEECHDAT Example Prompt Sheet, <https://www.l2f.inesc-id.pt/wiki/index.php/SPEECHDAT_Example_Prompt_Sheet> (11.01.2017).
Institut National de l’Audiovisuel (n.d.), <www.ina.fr> (11.01.2017).
Isosävi, Johanna (2010), Les formes d’adresse dans un corpus de films français et leur traduction en finnois, Mémoires de la Société Néophilologique de Helsinki, Helsinki, Société Néophilologique, <https://helda.helsinki.fi/handle/10138/19243> (11.01.2017).
(IULA) (n.d.a) = Institut Universitari de Lingüística Aplicada, BOBNEO – Banc de dades de l’Observatori de Neologia, <http://bwananet.iula.upf.edu> (11.01.2017).
(IULA) (n.d.b) = Institut Universitari de Lingüística Aplicada, Bwananet. Programa de explotació del Corpus Tècnic de l’IULA, <http://bwananet.iula.upf.edu> (11.01.2017).
Kallweit, Daniel (2015), Neografie in der computervermittelten Kommunikation des Spanischen. Zu alternativen Schreibweisen im Chatnetzwerk www.irc-hispano.es, Tübingen, Narr.
Klein, Jean René/Paumier, Sébastien/Fairon, Cédrick (2006), SMS pour la science. Corpus de 30.000 SMS et logiciel de consultation, Louvain-la-Neuve, Presses universitaires de Louvain.
Labeau, Emmanuelle (2007), De l’objectif au subjectif: le rapport du discours à la television, in: Mathias Broth et al. (edd.), Le français parlé des médias, Stockholm, Acta Universitatis Stockholmiensis, 365–382.
Laboratoire ICAR (2004), CLAPI – Corpus de langues parlées en interaction. Conversations téléphoniques – conversations entre amis – call friends – Français quebec – appel 5136, <http://clapi.univ-lyon2.fr/V3_Feuilleter.php?num_corpus=58> (11.01.2017).
Laboratoire ICAR (2007/2008), CLAPI – Corpus de langues parlées en interaction. Conversations en ligne, 8_Samira-Isabelle 2, <http://clapi.ish-lyon.cnrs.fr/V3_Feuilleter.php?chronoFeuille=f76a2c0dc9> (11.01.2017).
Laboratorio de Lingüística Informática (n.d.),CORLEC – Corpus Oral de Referencia de la Lengua Española Contemporánea, <http://www.lllf.uam.es/ESP/Corlec.html> (11.01.2017).
Lara, Luis Fernando (n.d.), CEMC – Corpus del Español Mexicano Contemporáneo, <http://www.corpus.unam.mx:8080/cemc> (11.01.2017).
Ledegen, Gudrun (2014), Grand corpus de sms smslareunion, in: Thierry Chanier (ed.), Banque de corpus CoMeRe, Nancy, Ortolang, <http://hdl.handle.net/11403/comere/cmr-smslareunion/cmr-smslareunion-tei-v1> (11.01.2017).
Lemnitzer, Lothar/Zinsmeister, Heike (2006), Korpuslinguistik. Eine Einführung, Tübingen, Narr.
Lindqvist, Christina (2001), Corpus transcrit de quelques journaux télévisés français, Uppsala, Uppsala Universitet.
Ljubešić, Nikola/Toral, Antonio (2014), caWaC – A web corpus of Catalan and its application to language modeling and machine translation, in: Nicoletta Calzolari et al. (edd.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Paris, European Language Resources Association (ELRA), 1728–1732.
Longhi, Julien, et al. (2014), Polititweets: corpus de tweets provenant de comptes politiques influents 1, in: Thierry Chanier (ed.), Banque de corpus CoMeRe, Nancy, Ortolang, <https://repository.ortolang.fr/api/content/comere/v1/cmr-polititweets.html> (11.01.2017).
López Martín, José M. (2011), La descortesía en el lenguaje radiofónico. El discurso de Federico Jiménez Losantos, PhD thesis, Universidad de Sevilla, <fondosdigitales.us.es/tesis/tesis/2087/la-des-cortesia-en-el-lenguaje-radiofonico> (11.01.2017).
Lorenz, Paulina/Michot, Nicolas (2012), Le lexique du chat sur Internet: étude comparative français–espagnol–polonais, SHS Web of Conferences 1, 939–954.
Lorenz, Paulina/Michot, Nicolas (2014), Les lexiques des jeunes dans les discours écrits des blogs: pour une approche descriptive, SHS Web of Conferences 8, 801–811.
Maraschio, Nicoletta/Stefanelli, Stefania (edd.) (2003), LIR – Lessico italiano radiofonico (1995–2003), Firenze, Accademia della Crusca.
Marcoccia, Michel/Gauducheau, Nadia (2007), L’analyse du rôle des smileys en production et en réception: un retour sur la question de l’oralité des écrits numériques, Glottopol 10, 39–55.
Martins, Simone (2008), A Construção da Identidade das Telenovelas Brasileiras: O Processo de Identificação dos Telespectadores com a Narrativa Ficcional Televisiva, <http://www.ufrgs.br/alcar/encontros-nacionais-1/encontros-nacionais/6o-encontro-2008-1/A%20Construcao%20da%20Identidade%20das%20Telenovelas%20Brasileiras.pdf> (11.01.2017).
Meinedo, Hugo/Souto, Nuno/Neto, João P. (2001), Speech Recognition of Broadcast News for the European Portuguese Language, <http://www.inesc-id.pt/pt/indicadores/Ficheiros/730.pdf> (11.01.2017).
(MICC) (n.d.a) = Media Integration and Communication Center, DIA-LIT – Lessico dell’italiano televisivo, <http://193.205.158.203/dialit> (11.01.2017).
(MICC) (n.d.b) = Media Integration and Communication Center, LIR – Lessico dell’italiano radiofonico, <http://193.205.158.203/Lir> (11.01.2017).
(MICC) (n.d.c) = Media Integration and Communication Center, LIT – Lessico dell’italiano televisivo, <http://193.205.158.203/lit_ric2> (11.01.2017).
Motter, Maria Lourdes/Jakubaszko, Daniela (2005), Telenovela e realidade social: algumas possibilidades dialógicas, <http://www.intercom.org.br/papers/nacionais/2005/resumos/R0994-1.pdf> (11.01.2017).
Mourlhon-Dallies, Florence (2007), Communication électronique et genres du discours, Glottopol 10, 11–23.
Munoz, Elisa/Alabiso, Jennifer/Graff, David (1998), 1997 Spanish Broadcast News Transcripts (HUB4-NE) LDC98T29, <https://catalog.ldc.upenn.edu/LDC98T29> (11.01.2017).
Panckhurst, Rachel (2010), Texting in three European languages: does the linguistic typology differ?, in: Proceedings of the First International Conference on Meaning in Interaction, University of the West of England, Bristol, 23–25 April 2009, <http://www.univ-montp3.fr/sl/rachel/M1/Bristol_RP.pdf> (11.01.2017).
Panckhurst, Rachel, et al. (2013), Sud4science, de l’acquisition d’un grand corpus de SMS en français à l’analyse de l’écriture SMS, Épistémè – revue internationale de sciences sociales appliquées 9, 107–138.
Panckhurst, Rachel, et al. (2014), 88milSMS. A corpus of authentic text messages in French, <http://88milsms.huma-num.fr> (11.01.2017).
Paredes, Coralie (2008), La escritura SMS: una forma “rebelde” de adaptación a las [sic] nuevos medios de comunicación, <http://cle.ens-lyon.fr/espagnol/la-escritura-sms-una-forma-rebelde-de-adaptacion-a-las-nuevos-medios-de-comunicacion-49982.kjsp> (11.01.2017).
Parodi, Giovanni (n.d.), El Grial – Interfaz de etiquetaje e interrogación de corpus textuales, <http://www.elgrial.cl> (11.01.2017).
Patti, Viviana, et al. (2013), Corpus SentiTUT, <www.di.unito.it/~tutreeb/sentiTUT.html> (11.01.2017).
Perelló-Oliver, Salvador/Muela-Molina, Clara (2013), Análisis de contenido de la publicidad radiofónica en España, methaodos.revista de ciencias sociales 1, 33–52.
Perona Páez, Juan José (2007), Formatos y estilos publicitarios en el prime-time radiofónico español: infrautilización y sequía de ideas, ZER – Revista de estudios de comunicación 23, 219–242.
Perona Páez, Juan José/Barbeito Veloso, Mariluz (2008), El lenguaje radiofónico en la publicidad del prime time generalista. Los anuncios en la “radio de las estrellas”, Telos 77, 115–124.
Pons i Griera, Lídia, et al. (n.d.), LipTV: Llengua i Publicitat a la Televisió. Presentació, <http://www.lipgrup.cat/forms/presentacio.php?m=17&t=0&id=17> (11.01.2017).
Rafel, Joaquim (1994), Un corpus general de referència de la llengua catalana, Caplletra 17, 219–250.
Ramilo, Maria Celeste/Freitas, Tiago (2002), A linguística e a linguagem dos média em Portugal: descrição do projeto REDIP, <http://www.iltec.pt/pdf/wpapers/2002-redip-redip.pdf> (11.01.2017).
(RDUES) (2016) = Research and Development Unit for English Studies, WebCorp Live – Concordance the web in real-time, <http://www.webcorp.org.uk/live> (11.01.2017).
Real Academia Española (2016), CORPES – Corpus del Español del Siglo XXI, versión provisional 0.83 (1 de junio de 2016), <http://web.frl.es/CORPES/view/inicioExterno.view> (11.01.2017).
Real Academia Española (2017), Corpus orales incorporados a CREA, <http://www.rae.es/publicaciones/corpus-orales-incorporados-crea> (11.01.2017).
Real Academia Española (n.d.), CREA – Corpus de Referencia del Español Actual, <http://corpus.rae.es/creanet.html> (11.01.2017)
Reese, Samuel, et al. (2010), Wikicorpus: A word-sense disambiguated multilingual Wikipedia corpus, in: Proceedings of LREC'2010, Valletta (Malta), 1418–1421.
Reese, Samuel, et al. (n.d.), Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia, <http://www.cs.upc.edu/~nlp/wikicorpus> (11.01.2017).
Rigoll, Gerhard (2002), Content filtering and retrieval in multimedia documents with the ALERT system, in: Hans-Jörg Bullinger/Anette Weisbecker (edd.), Content Management – digitale Inhalte als Bausteine einer vernetzten Welt, Stuttgart, Fraunhofer Gesellschaft, 61–66.
Rojo, Guillermo, et al. (n.d.), CORGA – Corpus de Referencia do Galego Actual, <http://corpus.cirp.es/corga> (11.01.2017).
Sánchez, Aquilino (1995), Cumbre – Corpus Lingüístico del Español Contemporáneo. Fundamentos, Metodología y Aplicaciones, Madrid, SGEL.
Sandré, Marion (2010), Débat politique télévisé et stratégies discursives: la visée polémique des ratés du système des tours, in: Marcel Burger/Jérôme Jacquin/Raphaël Micheli (edd.), Les médias et le politique. Actes du colloque “Le français parlé dans les médias”, Lausanne, 1–4 septembre 2009, Lausanne, Centre de linguistique et des sciences du langage, <https://www.unil.ch/clsl/fr/home/menuinst/publications/actes-fpm-2009.html> (11.01.2017).
Santillán, Elena (2009), Digitale Jugendkommunikation in der Informationsgesellschaft. Spanisch, Italienisch und Deutsch im Vergleich, Wien, Praesens. Online version: <http://othes.univie.ac.at/5362/1/2009-05-30_0106324.pdf> (11.01.2017).
Schlobinski, Peter/Siever, Torsten (edd.) (2005), Sprachliche und textuelle Merkmale in Weblogs. Ein internationales Projekt. Networx 46, <http://www.mediensprache.net/de/networx/docs/networx-46.aspx> (11.01.2017).
Schneider, Stefan/Bellini, Daniele (n.d.), BADIP – BAnca Dati dell’Italiano Parlato, <badip.uni-graz.at> (11.01.2017).
Sciubba, Maria Eleonora (2010), Salutations, openings and closings in today academic emails, Studi Italiani di Linguistica Teorica e Applicata 39:2, 243–264.
Sieberg, Bernd (2005), Sprachliche und textuelle Aspekte in portugiesischen Weblogs, in: Peter Schlobinski/Torsten Siever (edd.), Sprachliche und textuelle Merkmale in Weblogs. Ein internationales Projekt. Networx 46, <http://www.mediensprache.net/de/networx/docs/networx-46.aspx> (11.01.2017), 198–224.
Simon, Anne Catherine, et al. (2010), C-PROM corpus libre de parole multigenre, <http://sites.google.com/site/corpusprom> (11.01.2017).
Sketch Engine (2015a), Corpus esAmTenTen, <https://www.sketchengine.co.uk/esAmtenten-corpus> (11.01.2017).
Sketch Engine (2015b), Corpus esTenTen, <https://www.sketchengine.co.uk/estenten-corpus> (11.01.2017).
(SLI) (2003) = Seminario de Lingüística Informática da Universidade de Vigo CLUVI: Corpus Lingüístico da Universidade de Vigo, <http://sli.uvigo.es/CLUVI> (11.01.2017).
Sotelo Dios, Patricia (2011), Corpus multimedia VEIGA inglés-galego de subtitulación cinematográfica, Linguamática 3:2, 99–106.
Sotelo Dios, Patricia (n.d.), Corpus VEIGA, <http://sli.uvigo.es/CLUVI/vmm.html> (11.01.2017).
Sotelo Dios, Patricia/Gómez Guinovart, Xavier (2012), A Multimedia Parallel Corpus of English-Galician Film Subtitling, in: Alberto Simões/Ricardo Queirós/Daniela da Cruz (edd.), 1st Symposium on Languages, Applications and Technologies, OASIcs: Open Access Series in Informatics 21, Saarbrücken, Dagstuhl Publishing, 255–266.
Spina, Stefania (2015), Perugia Corpus, <https://www.unistrapg.it/cqpweb> (11.01.2017).
Stark, Elisabeth/Dürscheid, Christa/Meisner, Charlotte (2014), Linguists Research WhatsApp Communication: First Results, <http://www.whatsup-switzerland.ch/system/media/Whats-up-Switzerland-First-Results.pdf> (11.01.2017).
Stark, Elisabeth/Ueberwasser, Simone/Ruef, Beni (2009–2014), Swiss SMS Corpus, <https://sms.linguistik.uzh.ch> (11.01.2017).
Suchomel, Vít/Pomikálek, Jan (2012), Efficient web crawling for large text corpora, in: Serge Sharoff/Adam Kilgarriff (edd.), Proceedings of the seventh Web as Corpus Workshop (WAC7), 39–43, <https://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf> (19.03.2017).
Sullet-Nylander, Françoise/Roitman, Malin (2010), De la confrontation politico-journalistique dans les grands duels politiques télévisés: questions et préconstruits, in: Marcel Burger/Jérôme Jacquin/Raphaël Micheli (edd.), Les médias et le politique. Actes du colloque “Le français parlé dans les médias”, Lausanne, 1–4 septembre 2009, Lausanne, Centre de linguistique et des sciences du langage, <https://www.unil.ch/clsl/fr/home/menuinst/publications/actes-fpm-2009.html> (11.01.2017).
Taulé, Mariona/Martí, M. Antònia/Recasens, Marta (2008), Ancora: Multilevel Annotated Corpora for Catalan and Spanish, in: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC'2008), Marrakesh (Morocco), Valletta, 96–101, <http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf> (19.03.2017).
Universiteit Gent (n.d.), Corpus Finder: ASILA, <http://www.corpusfinder.ugent.be/corpus/84> (11.01.2017).
Van Compernolle, Remi Adam/Williams, Lawrence (2007), De l’oral à l’électronique: la variation orthographique comme ressource sociostylistique et pragmatique dans le français électronique, Glottopol 10, 56–69.
Vera, Agustín (1997), Proyecto Fénix: los medios de comunicación como recurso lingüístico, <http://congresosdelalengua.es/zacatecas/ponencias/tecnologias/proyectos/vera.htm> (19.03.2017).
Vila Pujol, María Rosa (2001), Corpus del español conversacional de Barcelona y su área metropolitana, Barcelona, Edicions Universitat de Barcelona.
Vila Rigat, Marta, et al. (2010), ClInt: a Bilingual Spanish-Catalan Spoken Corpus of Clinical Interviews, Procesamiento del Lengujae Natural 45, 105–111.
Voghera, Miriam, et al. (n.d.), Corpus VoLIP: Voce del LIP, <http://www.parlaritaliano.it/index.php/it/volip> (11.01.2017).
Wenz, Kathrin (2017), Bloguer sa vie – Französische Weblogs im Spannungsfeld zwischen Individualität und Gruppenzugehörigkeit, Frankfurt am Main et al., Lang.
Xunta de Galicia (n.d.), Institucións – Consello da Cultura Galega, <http://www.xunta.es/conselloda-cultura-galega> (11.01.2017).
18.221.158.222