You are browsing the archive for Ross Mounce.

What is content mining?

- March 27, 2014 in Call to Action, projects

It’s simple really – you can break it down into it’s two constituent parts:

  1. Content

In this context, content can be text, numerical data, static images such as photographs, videos, audio, metadata or any digital information, and/or a combination of them all. It is a deliberately vague term, encompassing all types of information. Do not confuse content for the medium by which content is delivered; content is independent of medium in the digital sphere.

  1. Mining

In this context, mining refers to the large-scale of information extraction from your target content. If you extract information from just one or two items of content – that’s ‘data extraction’. But if you extract information from thousands of separate items of content – that’s ‘mining’.

content mining can involve multiple types of content!

It is important to emphasise that the phrase ‘text & data mining’ refers only to the mining a subset of the types of content one may wish to mine: text & data. Content mining is thus a more useful generic phrase that encompasses all the types of content one may wish to mine.

For my postdoc I’m using content mining to extract phylogenetic information and associated metadata from the academic literature. To be specific, the content I’m mining is text AND images.

The text of an academic paper contains much of the metadata about a phylogenetic analysis I want to extract, whilst the actual result of a phylogenetic analysis is unfortunately mostly only published as an image in papers (completely non-textual) – thus I need to leverage mining techniques for multiple content types.

Most mining projects tend to just mine one type of content, with text mining being perhaps the most common. An example question one might attempt to answer using text mining is: How many academic journal articles acknowledge funding support from the Wellcome Trust?

Peter Murray-Rust has many more different possible uses for content mining on his blog: http://blogs.ch.cam.ac.uk/pmr/2014/02/27/101-uses-for-content-mining/

In general, content mining is still an emerging and under-utilized technique in the research sphere – there is much work still to be done and billions of questions still to be answered. It isn’t the one-stop solution to everything, but for appropriate questions it can be the most powerful & comprehensive approach available. Get mining now!

What is content mining?

- March 27, 2014 in content mining, definition, intro, research

It’s simple really – you can break it down into it’s two constituent parts:

  1. Content

In this context, content can be text, numerical data, static images such as photographs, videos, audio, metadata or any digital information, and/or a combination of them all. It is a deliberately vague term, encompassing all types of information. Do not confuse content for the medium by which content is delivered; content is independent of medium in the digital sphere.

  1. Mining

In this context, mining refers to the large-scale of information extraction from your target content. If you extract information from just one or two items of content – that’s ‘data extraction’. But if you extract information from thousands of separate items of content – that’s ‘mining’.

content mining can involve multiple types of content!

It is important to emphasise that the phrase ‘text & data mining’ refers only to the mining a subset of the types of content one may wish to mine: text & data. Content mining is thus a more useful generic phrase that encompasses all the types of content one may wish to mine.

For my postdoc I’m using content mining to extract phylogenetic information and associated metadata from the academic literature. To be specific, the content I’m mining is text AND images.

The text of an academic paper contains much of the metadata about a phylogenetic analysis I want to extract, whilst the actual result of a phylogenetic analysis is unfortunately mostly only published as an image in papers (completely non-textual) – thus I need to leverage mining techniques for multiple content types.

Most mining projects tend to just mine one type of content, with text mining being perhaps the most common. An example question one might attempt to answer using text mining is: How many academic journal articles acknowledge funding support from the Wellcome Trust?

Peter Murray-Rust has many more different possible uses for content mining on his blog: http://blogs.ch.cam.ac.uk/pmr/2014/02/27/101-uses-for-content-mining/

In general, content mining is still an emerging and under-utilized technique in the research sphere – there is much work still to be done and billions of questions still to be answered. It isn’t the one-stop solution to everything, but for appropriate questions it can be the most powerful & comprehensive approach available. Get mining now!

Some suggested tutorials & resources you might want to start with:

Banishing Impact Factor: OKF signs DORA

- June 23, 2013 in Announcements

Print The Open Knowledge Foundation has joined nearly 300 organizations in signing The San Francisco Declaration on Research Assessment (DORA). The laudable aim of DORA is to get people to stop using the Journal Impact Factor for research evaluation exercises, as it is harmful to science, and statistically illiterate to do so. The General Recommendation says:
Do not use journal-based metrics, such as Journal Impact Factors, as a surrogate measure of the quality of individual research articles, to assess an individual scientist’s contributions, or in hiring, promotion, or funding decisions.
There are also helpful, specific recommendations for funding agencies, institutions, publishers, organizations that supply metrics, and researchers. Over 7,000 individuals have signed so far, and among the many other organizations to have signed there are a multitude of notable ones including: American Association for the Advancement of Science (AAAS), Wellcome Trust, Proceedings of The National Academy Of Sciences (PNAS), EMBO, Cold Spring Harbor Laboratory Press, Genetics Society of America, Gordon and Betty Moore Foundation, Higher Education Funding Council for England (HEFCE), Howard Hughes Medical Institute, Society of Biology, Ubiquity Press, Austrian Science Fund (FWF), International Association for Plant Taxonomy, Gesellschaft für Biologische Systematik, Museu Nacional (Universidade Federal do Rio de Janeiro), Belgian Royal Society of Zoology and Belgian Journal of Zoology, Universidade de Brasilia, UCSD, Sao Paulo University… We expect this campaign to be a success, and hope that the commercial journals and societies that have actively chosen not to sign DORA will nevertheless consider the merits of DORA.    

Panton Fellowships: Apply Now!

- June 12, 2013 in Featured, Open Science, WG Open Data in Science

The Open Knowledge Foundation is delighted to announce the launch of the new Panton Fellowships!

CCIA Funded this year by The Computer & Communications Industry Association, Panton Fellowships will be awarded to scientists who actively promote open data in science, as per the Panton Principles for Open Data in Science. Visit the Panton Fellowships home page for more information including details of how to apply.

Further Details

We firmly believe that “open data means better science”. The Panton Fellowships have been created in order to support scientists – particularly graduate students and early-stage career scientists – to explore this idea, and to tackle those barriers which currently prevent science data from being made open. Dr Cameron Neylon, Advocacy Director at PLOS, and one of the Panton Fellowships Advisory Board, commented on the ‘real potential’ of the Fellowships to influence practice surrounding open data in the scientific community:
‘Panton Fellowships will allow those who are still deeply involved in research to think closely about the policy and technical issues surrounding open data.’
By allowing scientists the scope both to explore the ‘big picture’ – gathering evidence to promote discussion throughout the community – and also to work on specific technical solutions to individual problems, the Panton Fellowship scheme has the potential to make a real impact upon the practice of open data in science. Panton Fellows will have the freedom to undertake a range of activities, and prospective applicants are encouraged to formulate their own work plan. As Fellows will continue to be employed and/or study at their current institution, activities undertaken for the Panton Fellowship should ideally complement and enhance their existing work. Fellowships will be held for one year, and will have a value of £8k p.a. For more details and information on how to apply, please visit http://pantonprinciples.org/panton-fellowships/. Read about the work of our previous Panton Fellows; Sophie Kershaw here (PDF), and Ross Mounce here.  

Diverse stakeholders withdraw from Licences for Europe dialogue on text and data mining

- May 28, 2013 in Announcements

The Open Knowledge Foundation, along with several other representatives from the research sector, has withdrawn from the Licences for Europe dialogue on text and data mining due to concerns about the scope, composition and transparency of the process. A letter of withdrawal has been sent to the Commissioners involved in Licenses for Europe explaining the reason that these stakeholders can no longer participate in the dialogue and the wish to instigate a broader dialogue around creating the conditions to realise the full potential of text and data mining for innovation in Europe. The following organisations have signed the letter: Licences for Europe was announced in the Communication on Content in the Digital Single Market (18 December 2012) and is a joint initiative led by Commissioners Michel Barnier (Internal Market and Services), Neelie Kroes (Digital Agenda) and Androulla Vassiliou (Education, Culture, Multilingualism and Youth) to “deliver rapid progress in bringing content online through practical industry-led solutions”. Licences for Europe aims to engage stakeholders in four areas:
  1. Cross-border access and the portability of services;
  2. User-generated content and licensing;
  3. Audiovisual sector and cultural heritage;
  4. Text and Data Mining (TDM).
While we are deeply committed to working with the Commission on the removal of legal and other environmental barriers to text and data mining, we believe that any meaningful engagement on the legal framework within which data-driven innovation exists must address the opportunities provided by limitations and exceptions. The current approach of the Commission instead places further licensing as the central pillar of the discussion. The withdrawal follows much communication with the Commission on the issue, including a letter of concern sent on 26 February 2013 and signed by over 60 organisations. The Commission’s response to this letter is available here. To find out more about the background of Licence for Europe and the issues surrounding text and data mining please take a look at our background document.

Our Statement on Public Access to Federally-Supported Research Data

- May 16, 2013 in Announcements, External Meetings

Open Access to research publications often takes the limelight in national debates about access to research – but at the Open Knowledge Foundation we know there are also other pressing issues; like the need for Open Data. So we submitted a short written statement to the ongoing US Public Comment Meeting concerning Public Access to Federally Supported R&D Data. Our statement is below: Each year, the Federal Government spends over $100 billion on research. This investment, in part is used to gather new data. But all too often the new data gathered isn’t made publicly available and thus can’t generate maximum return on investment through later re-use by other researchers, policy-makers, clinicians and everyday taxpaying citizens. A shining example of the value and legacy of research data is the Human Genome Project. This project and its associated public research data are estimated to have generated $796 billion in economic impact, created 310,000 jobs, and launched a scientific revolution. All from an investment of just $3.8 billion. With the budget sequestration of 2013 and onwards it’s vitally important to get maximum value for money on research spending. By ensuring public access to most Federally funded research data it’ll help researchers do more with less. If researchers have greater access to data that’s already been gathered they can focus more acutely on accumulating just the new data they need, and nothing more. It’s not uncommon for Federally funded researchers to perform duplicate research and gather duplicate data. The competitive and often secretive nature of research means that duplicative research and data hoarding are probably rife, but hard to evidence. Enforcing a public data policy on researchers would thus help them to make the overall system more efficient. This tallies with the conclusions of the JISC report (2011) on data centres:
“The most widely-agreed benefit of data centres is research efficiency. Data centres make research quicker, easier and cheaper, and ensure that work is not repeated unnecessarily.”
Another more subtle benefit of making Federal-funded data more public is that it would increase the overall importance and profile of US research in the world. Recent research by Piwowar & Vision (2013) robustly demonstrates that research that releases public data gets cited more than research that does not publicly release its underlying data. The as yet untapped value of research data: I believe most research data has immense untapped re-use value. We’re only just beginning to realise the value of data mining techniques on ‘Big Data’ and small data alike. In the 21st century, now more than ever, we have immensely powerful tools and techniques to make sense of the data deluge. The potential scientific and economic benefits of such text and data mining analyses are consistently rated very highly. The McKinsey Global Institute report on ‘Big Data’ (2011) estimated a $300 billion value on data mining US health care data alone. I would finish by imploring you to read and implement the recommendations of the ‘Science as an Open Enterprise’ report from the Royal Society (2012):
  • Scientists need to be more open among themselves and with the public and media
  • Greater recognition needs to be given to the value of data gathering, analysis and communication
  • Common standards for sharing information are required to make it widely usable
  • Publishing data in a reusable form to support findings must be mandatory
  • More experts in managing and supporting the use of digital data are required
  • New software tools need to be developed to analyse the growing amount of data being gathered
Ross Mounce, Community Coordinator for Open Science, Open Knowledge Foundation 30 other written statements were also contributed to this session, including one from Creative Commons, and one from Victoria Stodden. These can all be found in the official 64 page PDF here Further Reading: Report: Science as an open enterprise (2012) http://royalsociety.org/policy/projects/science-public-enterprise/report/ Tripp, S & Grueber, M (2011) Economic Impact of the Human Genome Project. Battelle Memorial Institute, Technology. Partnership Practice www.labresultsforlife.org/news/Battelle_Impact_Report.pdf Piwowar, H & Vision T J (2013) Data reuse and the open data citation advantage. PeerJ PrePrint https://peerj.com/preprints/1/ JISC (2011) Data centres: their use, value and impact http://www.jisc.ac.uk/publications/generalpublications/2011/09/datacentres.aspx Manyika et al (2011) Big data: The next frontier for innovation, competition, and productivity http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

The Open Science Training Initiative

- May 13, 2013 in Announcements

Posted on behalf of Sophie Kershaw, one of our Panton Fellows 2012/13, recapping her work training the next generation in the art of open science. Over to you Sophie: Some of you may have been following the progress of my Panton Fellowship work over the past year, the main focus of which was establishing a graduate training scheme in open science, the Open Science Training Initiative (OSTI). There have been some exciting developments with the course in recent weeks and we’re really close to releasing the first full set of course materials for others to use in their own teaching to train young academics in open science and data-centric/digital research methodologies, so I thought I’d update you all on progress. If you’re interested in hearing about how the course works in practice, then scroll down for a download link to the post-pilot report! What is OSTI? The OSTI scheme is a teaching pattern and series of mini-lectures, designed to transform existing subject-specific graduate courses in the sciences to foster open working practices and scientific reproducibility. Its main features include:
  • dynamic teaching model of Rotation Based Learning,
  • hands-on application of licensing, data management and data analysis techniques, building students knowledge of, and confidence in using, these approaches;
  • daily lectures and exercises in key subjects including “Content, Code & Data Licensing”, “The Changing Face of Publication” and “Data Management Planning” accompany the main component of research time in the timetable, providing students with knowledge they can then consolidate through application to their research.
Open Science Training in Practice – download the report now! After many months of hard work and analysis, the post-pilot report on OSTI was released last Saturday and is now available for download from the OSTI website, via http://www.opensciencetraining.com/content.php. The report draws on a broad range of perspectives from the student cohort, the auxiliary demonstrators and the course leader. A curated data set to accompany the report will be appearing on the OSTI site very soon and lecture movies from the pilot initiative have been appearing on the site over the past week. Keep checking back over the coming weeks as more content and downloads become available. Where can I get course materials? The official set of course materials will be appearing on our GitHub repository over the coming weeks – these are currently being tweaked based on the feedback we received and I can’t wait for others to fork the project and create other versions of the course as well. Please feel free to get in touch with me if you’d like to hear more about OSTI, or have any comments, questions or suggestions. If we’re going to encourage uptake of Open working practices in the sciences, we need to start training our researchers in these approaches now. If you think there’s an opening at your institution for this kind of approach, then I would love to hear from you! You can tweet Sophie at @StilettoFiend or email her at: sophie dot kershaw at okfn.org

The White House Seeks Champions of Open Science

- May 8, 2013 in Open Access, Open Science, WG Open Data in Science

Here at the Open Knowledge Foundation, we know Open Science is tough, but ultimately rewarding. It requires courage & leadership to take the open path in science. Nearly a week ago on the open-science mailing list we started putting together a list of established scientists who have in some way or another made significant contributions to open science or lent their esteemed reputation to calls for increased openness in science. Our open list now has over 130 notable scientists, among whom 88 are Nobel prize winners. In an interesting parallel development, the White House has just put out a call to help identify “Open Science” Champions of Change — outstanding individuals, organizations, or research projects promoting and using open scientific data for the benefit of society. whitehouseOPENSCIENCE Anyone can nominate an Open Science candidate for consideration by May 14, 2013. What more proof do we need that open science is both good, and valued in society? This marks a tremendous validation of the open science movement. The US government is not seeking to reward any scientist; only open scientists actively working to change the world for the better will win this recognition. We’re still a long way from Open Science being the norm in science. But perhaps now, we’re a crucial step closer to important widespread recognition that Open Science is good, and could be the norm in the future. We eagerly await the unveiling of the winning Open Science champions at the White House on the 20th June later this year.

Science Europe denounces ‘hybrid’ Open Access

- May 2, 2013 in Open Access, Open Science, WG Open Data in Science

Recently Science Europe published a clear and concise position statement titled: Principles on the Transition to Open Access to Research Publications This is an extremely timely & important document that clarifies what governments and research funders should expect during the transition to open access. Unlike the recent US OSTP public access policy which allows publishers to apply up to a 12 month access embargo (to the disgust of some scientists like Michael Eisen) on publicly-funded research, this new Science Europe statement makes clear that only up to a 6 month embargo at maximum should be accepted for publicly funded STEM research. The recent RCUK (UK research councils) open access policy also requires 6 months embargo at most, with some caveats. But among the many excellent principles is a particularly bold and welcome proclamation:
the hybrid model, as currently defined and implemented by publishers, is not a working and viable pathway to Open Access. Any model for transition to Open Access supported by Science Europe Member Organisations must prevent ‘double dipping’ and increase cost transparency
Hybrid options are typically far more expensive than ‘pure’ open access journal costs, and they don’t typically aid transparency or the wider transition to open access. The Open Knowledge Foundation heartily endorses these principles as together with the above they respect, and reinforce the need for free access AND full re-use rights to scientific research. About Science Europe: Science Europe is an association of European Research Funding Organisations and Research Performing Organisations, based in Brussels. At present Science Europe comprises 51 Research Funding and Research Performing Organisations from 26 countries, representing around €30 billion per annum.

Weekly Citizen Science Hangouts

- April 3, 2013 in Announcements, Meetings

Capitalizing on the success of our recent CrowdCrafting hack day, from this Thursday and onwards every week we’ll be having a public Google+ Hangout to discuss citizen science and related topics. Details of the first meeting are below:

Thursday 4th April, 5pm (BST) – Weekly Citizen Science Hangout on Google+ here

In the first meeting we shall talk with special guest Michal Kubacki about his Misomorf application that may eventually be developed into a citizen app to help scientists with the graph isomorphism problem. This problem was proposed at the recent hack day by mathematician and quantum computing expert Simone Severini of UCL, who will also join the Hangout. Your participation at these weekly meetings is both welcome and encouraged. If you can’t make the first one, then perhaps the next?

Further information from the recent Science Hack Day

Among the many hack day projects I didn’t get to write about in the last blog post were the Yellowhammers project. This citizen science project uses sound recordings of yellowhammer (bird) dialects and is a joint activity of the Department of Ecology, Charles University in Prague, and the Czech Society for Ornithology.

Volunteers have helped both sample bird song, and classify these hundreds of hours of audio recordings into different dialects. Whilst the initial project sample birds just in the Czech Republic, the team are now expanding to try and capture bird song recordings from the United Kingdom and New Zealand where this bird species also lives.

Here at the Open Knowledge Foundation, we hope that the data collection and analysis could be aided by both the use our out Crowdcrafting.org platform and the EpiCollect mobile software. Work is presently under-way to make this a reality.

So whether, a software-dev, amateur scientist, tinkerer or twitcher – perhaps we might see you tomorrow at the Citizen Science Hangout?