Instituut voor Beeld en Geluid
Instituut voor Beeld en Geluid
Beeld & Geluid is not only the museum of media and television but is also responsible for the archiving of all the audio-visual content of all the Dutch radio and television broadcasters. Around 800.000 hours of material is available in the Beeld & Geluid archives.
The full archive of Beeld & Geluid is available to the general public. Users can search for and purchase full or partial broadcasts of their favourite shows. Broadcasters have their own dedicated website for searching the same archive. Broadcasters use this website to search for subject related footage. They can purchase the rights to the footage and include it in the evening news.
Goal of the project
Trifork replaced the existing search solution of the content archive. Although still reasonably fast and usable, Beeld & Geluid has made the decision to replace the current search solution with a more modern full text search engine, Elasticsearch. The decision took into consideration two factors. Firstly, the existing engine is proprietary and has reached end-of-life, support no longer being offered. Secondly, Beeld & Geluid would like to open its searchable archive to other input sources, but also to other search interfaces which will be easier to accomplish with a system designed to be open from the start.
The main goal for the first phase of the project was to replace the current full text search system, while keeping the changes oblivious to both the input and the output interfaces.
Short overview of the online catalogue searching solution
In the Beeld & Geluid online catalogue Trifork considered a document to contain all the information regarding a particular broadcast: like title, description, summary, date of broadcast. Currently there are two types of documents that are searchable in the catalogue: one containing the full broadcast and another containing just a selection of it.
Internally, documents are stored as XMLs (“Storage” area of the above drawing). The XML is also parsed and relevant data is sent to the the full text search engine for indexing.
The system receives indexing requests of a new broadcast in the form of SOAP requests. From the searching interfaces (searching websites) the system also receives SOAP requests. There are two major types of searches: search requests – these are internally handled by the full text search engine – and view document details requests – internally handled by the storage. A view document detail request is done when the user clicks to view the details of the document. This happens after the user has performed a search – at this point the id of the document is known, making it simple to be retrieved from a storage system.
Replacing the full text search engine
Trifork chose Elasticsearch to replace the full text search engine. Elesticsearch was choosen over Solr for what Trifork feels is an easier to use Java API and for the clustering and fail-over support that Elasticsearch provides out-of-the-box.
Replacing the storage backend
A NoSQL database is an ideal candidate for replacing the current storage solution. There is very little dependency between the documents, each document must be stored and returned as XML and most of the requests are done by id. Trifork considered MongoDB, but in the end decided to use Elasticsearch as storage as well. This helps keep the application stack small, that in turn speeds up the learning curve for new developers.
Trifork successfully implemented both SOAP web services, has a fully functioning Elasticsearch storage backend and has teplaced the full text search engine with Elasticsearch. In addition to sharing technical expertise, Trifork has also introduced Beeld & Geluid to the Scrum development framework. Trifork helped with the planning and estimation meetings, facilitated the retrospectives, and generally answered any questions Beeld & Geluid might have had about the process. The described work has been accomplished in two sprints of 3 weeks each.