What is the problem that the OCR platform tries to solve?
The Hellenic Parliament stores parliamentary questions using a combination of metadata extracted manually from the original text and the scanned document as an image file. Consequently, the parliamentary questions cannot be widely accessed or studied because of the lack of access to the original content. A 10-step OCR process was designed in order to fully reconstruct the original content of the parliamentary questions. The results from the OCR process are combined with the metadata to reproduce the full text of the original document.
Why the choice to crowdsource?
Resolving the data deficiencies during several project stages proved very time-consuming, which led us to trial a crowdsourcing approach to speed up the work. A team of scientists from a wide spectrum of disciplines, including academics, officials and students from several institutions, such as the Hellenic Parliament, the University of Athens and the University of Cyprus, as well as contracted or self-employed software developers, are dedicated to making the project a success.
How does the OCR platform work?
All 27 members across seven countries are virtually linked through an online exchange platform, while Athens-based members gather for monthly meetings where problems are discussed and best practices exchanged. Newcomers receive basic training upon entering the group, while more experienced members, called “mentors”, provide peer-to-peer advice and support. Team members process the parliamentary texts assigned to them at their convenience and at their own pace. The finished text units, called “packages”, then pass through a quality proof step by the mentors and are pipelined for scientific examination.
What’s in it for the “crowd”?
The initiative creates a win-win situation by enabling members to gain early access to scientific projects and acquire valuable new skills as well as hands-on work experience in exchange for their time and support.
What are the ultimate benefits?
At the end of the process, the digital content is made available in an open and structured format, such as XML (eXtensible Markup Language). Novel tools and methods from the field of computational linguistics can then be applied. The availability of unified, verified corpora allows several formerly disparate areas of research, such as history, political science and linguistics, to be interlinked, thus opening up new horizons in the understanding of parliamentary information and discourse.
What does the future hold for the OCR Team?
Being close to successfully process a decade of parliamentary oversight data, the platform has redirected its efforts to trying to lead digital transformation in the Hellenic Parliament and beyond through the fields of legal informatics and linked open data. Furthermore, as the platform creates open source software for conducting and structuring data collection and conducting computational linguistics analyses, the Hellenic OCR team now plans to merge its work into a suite of apps and services with the ultimate goal of serving as an integration and migration platform that can be configured for and extended to the needs of any parliament or organization.
Dr. Fotis Fitsilis, Scientific Service, Hellenic Parliament
Tel.: +30 210 3673395