THE COMPUTATIONAL LINGUIST

Alongside photography, I’ve always been passionate about my work as a research scientist in Computational Linguistics—the field that teaches computers to process and understand human language. I found the ideal environment for this work at the European Commission’s Joint Research Centre (JRC), located near Lago Maggiore in Northern Italy. I consider myself fortunate to have been active during the pioneering years of the field, right through to the era of widely used language technology applications.

This section of my website is dedicated to that important chapter of my life: to explain what computational linguistics is, what motivated my work, and to highlight key projects, experiences, and resources I helped create.

What is Computational Linguistics?

Put simply, Computational Linguistics (CL) focuses on enabling computers to handle human language. It powers software that helps people retrieve information efficiently, access content in multiple languages, and communicate across linguistic boundaries. Typical applications include:

Machine Translation
Document Categorisation
Information Extraction
Multi-document Summarisation
Sentiment Analysis.

CL overlaps closely with fields such as Text Mining, Natural Language Processing (NLP), and Language Engineering. The methods used range from rule-based (symbolic) approaches to Machine Learning and Artificial Intelligence.

My Specialisations

My main areas of expertise within CL include:

Multilinguality and cross-lingual information access
Developing text mining tools for many languages with limited human effort
Fusing and linking information across languages
Giving users access to foreign-language information.

Career Highlights

In 1998, I joined the Joint Research Centre (JRC) of the European Commission, where I served as a senior scientist at the Competence Centre for Text Mining and Analysis.

One of my key contributions was to the development of the Europe Media Monitor (EMM), a publicly accessible media monitoring platform. EMM analyses over 320,000 online news articles daily in about 70 languages, sourced from approximately 12,000 outlets (status: mid-2019).

EMM capabilities include:

Grouping related articles
Categorising content into thousands of subject domains
Extracting and disambiguating entities (persons, organisations, places, products, events)
Identifying and translating direct speech quotations
Tracking news over time
Linking stories across languages.

EMM enables users to explore diverse perspectives by comparing how the same story is reported across countries and languages. It promotes transparency, cultural understanding, and media literacy. Main public applications include:

EMM is used by a wide range of institutions, including EU bodies, EU Member State authorities, UN sub-organisations, the African Union, and the Organisation of American States.

▶︎ Read: An introduction to the Europe Media Monitor family of applications

Career Path

Ph.D. in Computational Linguistics / Machine Translation
University of Manchester Institute of Science and Technology (UMIST), UK, 1994
Previous roles:
- Sharp Laboratories of Europe, Oxford (UK)
- Kyushu Institute of Technology, Japan
- Institute for Applied Information Science (IAI), Saarbrücken (Germany)
- African Union, Addis Ababa (Ethiopia).

Scientific Publications and Outreach

I co-authored around 130 international peer-reviewed publications, many of which are accessible via:

I was also invited as a keynote speaker at numerous conferences and workshops. Some of my talks are available online, e.g. on Videolectures.net.

Open Language Resources

A central goal in my work was to create and freely share large-scale multilingual resources to accelerate research in language technology. These include:

Parallel Corpora:
- JRC-Acquis, DGT-Acquis, Digital Corpus of the European Parliament
Translation Memories:
- DGT-TM, ECDC-TM, EAC-TM
Text Categorisation Tool:
- JRC Eurovoc Indexer (JEX)
Name Variant Resource:
- JRC-Names

▶︎ Read: An overview of the European Union’s highly multilingual parallel corpora

Keywords for Search and Discovery

Computational Linguistics, Text Mining, Natural Language Processing, Information Extraction, Named Entity Recognition (persons, organisations, locations, events, more), Document Clustering, (Multi-label) Categorisation, Summarisation, Terminology Extraction, Quotation Recognition, Opinion Mining (Sentiment Analysis), Multilingual Linguistic Resources.

In 2019, I took early retirement from the JRC to devote more time to photography and other creative pursuits.

Photo by Ricardo Rodrigues da Silva