Citizen Scientist’s Algorithm Helps Science Gossip Team to Reduce Text-only Pages

Thanks to the efforts of an active Zooniverse volunteer and the Science Gossip project team, users can now focus on the beautiful illustrations found in Science Gossip’s 19th century natural history periodicals and spend less time marking pages with no illustrations. The Biodiversity Heritage Library (BHL) had developed algorithms for filtering out pages without images in a previously-related project called Art of Life, where project partners from the Indianapolis Museum of Art Lab developed 4 algorithms. The two deemed most useful were based on 1) coordinate metadata from ABBYY software and 2) contrast properties of the pages. The Science Gossip project team had considered using these algorithms to filter pages before uploading to the Zooniverse site but decided against it because it was surmised users might like to view all pages in a journal for contextual reference. After the launch in March 2015, it became clear many Science Gossip users wanted the team to reduce the numbers of pages without illustrations because they didn’t want to spend their time on these types of pages when the project was really about illustrations.

An active volunteer, Briana Harder (aka Quia on Zooniverse), prodded the Science Gossip team to consider using automated methods for filtering and even put together an algorithm herself for the team to test against its existing algorithms. Briana’s algorithm picks out chunks of images and if the background is too variable sometimes picks out text. When comparing the accuracy of the 3 algorithms together, Briana’s and the BHL ABBYY algorithm performed well with less than a 1% margin of error. Contrast performed poorly and Briana and the team deemed it was not useful for filtering. In the end it was decided to just utilize the ABBYY algorithm since the pages had already been processed by that algorithm and it would be much quicker to implement.

The filter was applied in mid-May. Since then the number of pages without illustrations has been reduced considerably, hopefully resulting in a more satisfying experience for our users. Thanks go out to Briana and the folks on the team who worked on this task. When asked what her motivations were for contributing to Science Gossip and other Zooniverse projects Briana explained:

My involvement with Science Gossip is an adventure in serendipity. Darren McRoy, Zooniverse’s community builder, gave me a nudge to go check out some of the newer Zoo projects, among others I ended up on Floating Forests, and in one of their blog posts, they asked for help in improving their pipeline for selecting coastline images for classifications. […] I wrote an algorithm that improved the pipeline […]

Briana’s work on Floating Forests led that team to reprocess their data and dramatically speed up the project. While this reprocessing was underway, Science Gossip launched a beta test. Briana noticed that there were a lot of text-only pages in this project—another opportunity for an algorithm!

I thought ‘There has to be a good way to filter out all these pages, text recognition is a well developed field…I bet I could write something to filter these so the project can be more efficient.’

And she was absolutely right –Thank you Briana and all Zooniverse users who go above and beyond to help us improve our projects! The collaborative spirit of this community continues to benefit all of us in ways we never expected. We hope everyone is benefiting from the reduced noise in the Science Gossip dataset and would love feedback from our users on the impact of the filtered pages.

Trish Rose-Sandler, Data Analyst, BHL and Data Projects Coordinator, Missouri Botanical Garden

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s