Automatic, Format-independent Generation of Metadata for Documents Based on Semantically Enriched Context Information

THOENSSEN, BARBARA

The purpose of this study was to investigate how metadata can be generated automatically for all types of documents used in an enterprise, regardless of their content. Because of the increasing number of non-textual documents, i.e. images, audio and video files, full-text indexing is not applicable and thus, the use of metadata has become more and more important for resource description and discovery. However, creating metadata manually is time consuming and error prone and moreover barely feasible for the huge amount of documents an enterprise deals with daily. Thus, an approach for automatic, format-independent metadata generation is required. To begin the documents' context was analysed. A document is considered an enterprise object, which is related to other enterprise objects such as a task the document is used in and the purpose it is created for. It was recognised that context of a document can be described formally and semantically enriched in an enterprise architecture. This enterprise architecture description can then be used for automatic metadata generation. To use the enterprise architecture description in a productive environment it was determined how its objects can be linked to enterprise components, e.g. information stored in a relational database. Finally a procedure for setting-up, conducting and utilizing the metadata generation approach in an enterprise was identified. The combination of these objectives has been called mintApproach. With the mintApproach system the huge annual economic loss due to the vast time wasted on information retrieval is addressed. Research design followed the deductive approach and a mixed method strategy was employed, combining the four methods: results of a Representative Study provided a comprehensive source for the analysis of the use of document creation tools in enterprises and preferred search strategies. Qualitative interviews conducted in a survey and based on a structured questionnaire provided insights on document handling in enterprise. Action Research and prototyping was applied in two different types of organisations, a non-profit organisation (NPO) in the domain of sexual health and a small and medium-sized enterprise (SME), developing contract management software. Evolutionary 'prototyping' built an integrated part of the Action Research studies and led to the development of an executable prototype. Applying Action Research in two enterprises, with very different business and business goals, helped to avoid the common pitfalls of this method like subjectivity, lack of generality and replication. The results of the survey and the Action Research studies endorsed the fact that for document management in enterprises and public administrations alike, a document's context is considered. Although relations between documents and other enterprise objects may be hidden, low level governance instruments like guidelines for file storage help to reveal these relations. For example relations to other enterprise objects like a product or a client are implicit in the file structure in which a document is stored. Determining the naming conventions for files is another way of implicitly stating relations between enterprise objects and documents. This explicit information is represented in a semantically enriched Enterprise Architecture description. It was found that the well-known standard for Enterprise Architecture modelling, ArchiMate, was well suited for providing the basis for core enterprise ontology. ArchiMate was refined, enhanced, and represented in RDFS-Plus, an ontology language which is machine executable but also cognitively adequate for humans. This core ontology was enhanced by application of specific ontologies reflecting enterprise specific needs, for example for representing domain knowledge or improving contract lifecycle management. The enter rise ontology was considered a part of an enterprise repository, comprising all enterprise objects constituting an organisation despite their representation. Thus, for automatic, format-independent metadata generation based on context, ontology-to-database-mapping was considered suitable (why not was used?) The approach was evaluated based on an executable prototype that illustrates the scientific models and makes it easier for the evaluators to assess the underlying scientific concepts. Goal of the evaluation was to determine the appropriateness, capability and applicability of the mintApproach. The mintApproach, visualized in the MeGaWorkbench prototype, was assessed as appropriate for automatic format-independent metadata generation for business documents. Using context for metadata generation was considered promising, particularly regarding multi-media documents, respectively documents with little, meaningless or even wrong document attributes. The mintApproach was considered beneficial as it helps to meet business needs in handling the ever-increasing amount of unstructured information by reducing the amount of personnel time involved.