You are browsing the archive for intro.

What is content mining?

- March 27, 2014 in content mining, definition, intro, research

It’s simple really – you can break it down into it’s two constituent parts:

  1. Content

In this context, content can be text, numerical data, static images such as photographs, videos, audio, metadata or any digital information, and/or a combination of them all. It is a deliberately vague term, encompassing all types of information. Do not confuse content for the medium by which content is delivered; content is independent of medium in the digital sphere.

  1. Mining

In this context, mining refers to the large-scale of information extraction from your target content. If you extract information from just one or two items of content – that’s ‘data extraction’. But if you extract information from thousands of separate items of content – that’s ‘mining’.

content mining can involve multiple types of content!

It is important to emphasise that the phrase ‘text & data mining’ refers only to the mining a subset of the types of content one may wish to mine: text & data. Content mining is thus a more useful generic phrase that encompasses all the types of content one may wish to mine.

For my postdoc I’m using content mining to extract phylogenetic information and associated metadata from the academic literature. To be specific, the content I’m mining is text AND images.

The text of an academic paper contains much of the metadata about a phylogenetic analysis I want to extract, whilst the actual result of a phylogenetic analysis is unfortunately mostly only published as an image in papers (completely non-textual) – thus I need to leverage mining techniques for multiple content types.

Most mining projects tend to just mine one type of content, with text mining being perhaps the most common. An example question one might attempt to answer using text mining is: How many academic journal articles acknowledge funding support from the Wellcome Trust?

Peter Murray-Rust has many more different possible uses for content mining on his blog:

In general, content mining is still an emerging and under-utilized technique in the research sphere – there is much work still to be done and billions of questions still to be answered. It isn’t the one-stop solution to everything, but for appropriate questions it can be the most powerful & comprehensive approach available. Get mining now!

Some suggested tutorials & resources you might want to start with:

The Open Linguistics Working Group

- May 20, 2011 in intro, linguistics, open linguistics, WG Linguistics

Status Quo and Perspectives, by Christian Chiarcos and Sebastian Hellmann

Since its formation last year, the Open Linguistics Working Group (OWLG) has been steadily growing and the direction the working group is heading has been clarified (although a number of issues remain open). In the last months, we concentrated on the identification of goals and directions for this working group to pursue, and in this blog post, we summarize results of this process, about its current status as well as the main challenges and problems we have identified so far. An important result of our discussion are the seven points described in the next section, which define the purpose of the working group.  In the next section, we summarize four major problems and challenges of the work with linguistic data. Such problems will become a primary topic of the Working Group. Thereafter, we give an overview of the current status and activities of the group and provide some suggestions for how to get involved.


As a result of numerous discussions with interested linguists, NLP engineers and information technology experts, we identified seven open problems for our respective communities and their ways to use, to access and to share linguistic data. These represent the challenges to be addresses by the working group, and the role that it is going to fulfil:
  1. Promote the idea and definition, as specified in of open data in linguistics and in relation to language data.
  2. Act as a central point of reference and support for people interested in open linguistic data.
  3. Provide guidance on legal issues surrounding linguistic data to the community.
  4. Build an index of indexes of open linguistic data sources and tools and link existing resources.
  5. Facilitate communication between existing groups.
  6. Serve as a mediator between providers and users of of technical infrastructure.
  7. Assemble best-practice guidelines and use cases to create, use and distribute data.
In many aspects, the OWLG is not unique with respect to these goals. Indeed, there are numerous initiatives with similar motivation, e.g., the Cyberling blog, the ACL Special Interest Group for Annotation, and large multi-national initiatives as the ISO initiative on Language Resources Management (ISO TC37/SC4) or European projects such as CLARIN, FLARENET and METANET. The key difference between these and our Working Group is that we are not affiliated to an existing organization or one particular community, but that our members represent the whole band-width from academic linguistics (with its various subfields, e.g., typology and corpus linguistics) over applied linguistics (e.g., language documentation, computational linguistics, computational lexicography) and computational philology to natural language processing and information technology. We do not consider ourselves as being in competition with any existing organization, but hope to establish new links and further synergies between these. In the following section, we summarize typical and concrete scenarios where such an interdisciplinary community may help to resolve problems observed (or, sometimes, overlooked) in the daily praxis of working with linguistic resources.

Open linguistics resources, problems and challenges

Among the broad range of problems associated with linguistic resources, we identified four major classes of problems and challenges during our discussions that may be addressed by the OWLG. First, there is a great uncertainty with respect to legal questions of the creation and distribution of linguistic data; second, there are technical problems such as the choice of tools, representation formats and metadata standards for different types of linguistic annotation; third, we have not yet identified a point of reference for existing open linguistic resources; finally, there is the agitation challenge, i.e., how (and whether) we should convince our collaborators to release their data under open licenses. These challenges are described below in detail.

1. Legal questions
The linguistic community becomes increasingly aware of the potentially difficult legal status of different types of linguistic resources:
  • How to find a suitable license for my corpus ?
  • Whose copyright do I have to respect ? For example, corpora may have complex copyright situations where the original authors own the primary data, and thus may have partial copyright on the entire collection.
  • Are there exceptions (e.g. for academic research) to the copyright that may allow me to work with my corpus anyway ?
  • How to circumvent (or solve) copyright issues ?
  • What legal restrictions apply to a particular resource (e.g., web corpora, newspaper corpora, digitizations of printed editions, audio and video files) ?
  • How to create multi-media (audio, video) data collections in a way that allows us to use (and hopefully, distribute) them for research ?
The situation is even more complex because the legal situation may change over time (e.g., German copyright law was changed twice within the last decade), and this complexity multiplies on an international scale. The OLWG provides a platform to discuss such problems, to collect recommendations and document use cases as found in publications and technical reports, and discussed on conferences and mailing lists.

2. Technical problems
Often, when creating a new corpus in a novel domain, the question is to be answered which tool to choose for which type of annotation. The OLWG will collect case studies and best practice recommendations with respect to this, it will encourage the documentation of use cases, collect links to documented case studies and best practice recommendations (e.g., by EMELD, or FLARENET), and participate in the maintenance of existing sites that provide an overview over annotation tools and their domain of application (e.g., the Linguistic Annotation Wiki, or corresponding parts of the ACL Wiki). A question related to the choice of tools is the question which representation formalisms to choose. We intend to provide basic information about proposed standard formats (e.g., the ISO proposal LAF/GrAF, the specifications of the Text Encoding Initiative [TEI]) and applicable formalisms (e.g., XML or RDF). These formats, again, are closely related to the question which corpus infrastructure (data base, search interface) may be suitable to store, query and visualize what kind of linguistic annotations (e.g., domain- and community-specific tools like Toolbox and ELAN, or general-purpose corpus query tools like ANNIS). A third problem is the question of documentation requirements for different types of resources, the use of metadata standards (e.g., Dublin Core, or the TEI header), and how annotation documentation and interoperability can be improved linking linguistic resources with terminology repositories (e.g., GOLD, ISOcat). The OLWG aims to collect such questions and (partial) answers to these, we will contribute to existing metadata repositories and co-operate with other initiatives that pursue similar goals, e.g., the ACL Special Interest Group in Linguistic Annotation. As opposed to these, the OLWG does not require membership in a particular organization, and we carry a focus on linguistic resources released under an open license. Further, we encourage (but do not require) the conversion of linguistic resources to Linked Data.

3. Overview over existing resources
If a new research question is to be addressed, the question arises which resources may already be available and whether these may be accessible, and often, this problem is still solved by asking experts on mailing lists, e.g. the CORPORA list. Therefore, the OLWG has begun to collect metadata about open linguistic resources within the CKAN repository. Although there are other metadata repositories (e.g., those maintained by META-NET, FLARENET, or CLARIN) available, the CKAN repository is qualitatively different in two respects: On the one hand, CKAN focuses on the license status of the resources and it encourages the use of open licenses. On the other hand, it is not specifically directed to linguistic resources, but rather, it is used by a large set of different working groups, whose resources may be exploited by linguists (e.g., exhaustive collections of legal documents from several countries [from law], or the open richly annotated cuneiform corpus [from archeology]).

4. Agitation
One of the goals of the OWLG is the promotion of open licenses for linguistic data collections. As we know from practical experience, researchers sometimes hesitate to provide their data under an open license. There has many different reasons for this, ranging from the uncertainty with respect to the legal situation to the (understandable) because fear that people exploit the resources before the original author had the chance to do so. We hope to contribute to the clarification of legal issues and to provide case studies that may help to clarify these problems. For example, one solution for second aspect mentioned above may be that data collections are designed as open linguistic resources from the beginning, but that their publication is delayed for several years, so that the creators can exploit this data long enough before any concurrent may get hands on it. One important argument that favors the use of open resources in academia is that only resources that are available to other researchers make it possible that empirically working linguists meet elementary scientific standards such as verifiability. Following this premise, we intend to promote the use of open resources in linguistics.

Current status and on-going developments (as of May, 19th, 2011)

So far, we focused on the task to delineate what questions the Open Linguistics Working Group may address, to formulate its general goals and potentially fruitful application scenarios. This blog entry summarizes these discussions, and it concludes a critical step in the formation process of the working group: Having defined a (preliminary) set of goals and principles, we can now concentrate on the tasks at hand, and in to collect resources and to attract interested people in order to address the challenges identified above. At the moment, our Working Group assembles 32 people from 21 different organizations and 7 countries (Germany, US, UK, France, Canada, Hungary, and Slovenia). Our group is relatively small, but continuously growing and sufficiently heterogeneous. It includes people from library science, typology, historical linguistics, cognitive science, computational linguistics, and information technology, just to name a few, so, the ground for fruitful interdisciplinary discussions has been laid out. We are very glad that famous linguists such as Nancy Ide (Text Encoding Initiative, American National Corpus, Vassar College) and Christiane Fellbaum (WordNet, University of Princeton) accepted our invitation to post guest blogs, and we would like to intensify this tradition and encourage all members of the OWLG to describe interesting projects and experiences on this medium, to share insights and difficulties over the Open Linguistics mailing list, and, of course, to join our meetings and telcos. The next meeting is about to be held in conjunction with the Fifth Open Knowledge Conference (OKCon), June 30th to July 1st 2011 in Berlin, Germany, and of course the OKCon itself is a great reason to join us there. As for our first concrete activities, we have begun to compile a list of resources of particular interest to the members of the working group. Most of these resources are free, others are partially free (i.e., annotations free, but text under copyright), and a few have been included that are very representative for a particular type of resource (e.g., corpora derived from the Penn Treebank as a prototypical multi-layer corpus). Altogether, the list comprises 102 entries by now, and the next step would be to register them at the CKAN metadata repository and to select a few for deeper investigation. One aspect of such investigations may be the conversion of some of the resources to RDF and to provide them as Linked Data. Several working group members (including the authors of this blog) are working towards this direction. The ultimate result may be an Linguistics Linked (Open) Data cloud, as sketched in the graphic to the right (click to enlarge). On this basis, novel applications in all participating fields may be developed.

Get involved

Having all that said, we hope to have encouraged others to contribute and to join. And if indeed we succeeded in doing so, you may be interested in how to join and how to contribute:
How to join How to contribute
  • Register your (open) resources at CKAN (and please, don’t forget to tag them as “linguistics”)
  • Attend meetings / telcos (announced over the mailing list
  • Write blog posts for our blog
  • Become a group administrator on CKAN (on request)