Home > Research > Contracts & Projects > IDEX (JEDI) - Vidéo Ethnotextes et Ressources Associées (VERA)

IDEX (JEDI) - Vidéo Ethnotextes et Ressources Associées (VERA)


BCL’s dialectology research team has carried out over the last decade more than one hundred linguistics field surveys, among 75 locations (covering southern France and Occitan valleys of Piedmont), with more than 120 informants interviewed, and more than 60 h of video recorded, mostly in Occitan dialects, but also sometimes in Ligurian, leading to a total of 58 ethnotexts collected. These dialectal data are of major scientific and heritage interest: unpublished, precisely geolocalized data, collected from the latest native speakers; clearly positioned in the field of orality and linguistic diversity, in contrast to a "standardized" or "normative" Occitan that can be found in many of the videos already available on the web.

For each record, a consequent set of associated resources will be provided (distributed under LGPL-LR license): metadata (in qualified Dublin-Core and OLAC formats) with GPS coordinates allowing cartographic processing of the data; transcriptions in IPA phonetics as well as in written form using local spelling conventions (in TEI-XML P5 format) allowing to search for a particular term in the content of the videos; translation in French (and maybe English) allowing a broader access to this content for everyone without need for prior linguistics skills in such or such dialect; accompanying texts (written by historians) allowing to relocate the discourse in its historical context ; sentence-by-sentence timings segmentation (in XML format) allowing a number of applications (click on a sentence to listen this specific passage in the video, quick comparison of the pronunciation of a given sentence among several different dialects, dynamic subtitling in WebVTT format, etc.); and linguistics annotations (lemmatization and part-of-speech tagging) allowing linguists researcher to study dialectal microvariation.

All those videos and associated resources will be freely available on the web, and accessible through 4 different channels: long-term preservation archiving of all data on CoCoON; development of a participative web platform (in order to offer Internet users a more ergonomic and more efficient interface for consulting and cartographic processing of the data); implementation of a REST API (allowing direct requests to the database and fostering interoperability with specialized search engines such as Edisyn or the dicod’Òc / vèrb’Òc); and upload of the ethnotexts on the THESOC YouTube™ channel (records distributed under CC-BY-NC-ND license).

Following the Open Data movement, this project aims to open up those data not only to the entire scientific community, but also towards teachers and learners of regional languages, and by extension towards the general public; with a lot of possible applications: natural language processing (artificial intelligence), linguistics (lexicon, phonetics, syntax), non-verbal communication, history, sociology, ethnology…