WIT Press

Using entity identification and classification for automated integration of spatial-temporal data

Price

Free (open access)

Volume

Volume 11 (2016), Issue 3

Pages

11

Page Range

186 - 197

Paper DOI

10.2495/DNE-V11-N3-186-197

Copyright

WIT Press

Author(s)

R. AHSAN, R. NEAMTU & E. RUNDENSTEINER

Abstract

Big data, crucial to answering economic, social, and political questions facing our society, tend to be diverse and distributed through various sites across the Internet. The creation of tools to integrate and analyze such data is of paramount interest. Yet the automation of these processes continues to be a great challenge. Our work rests on the observation that a great number of public data sources in domains ranging from economic to demographic, although of complex structure, often share key similarities, namely the presence of the Time and Location.  Our proposed Data Integration  through Object Modeling framework or DIOM tackles the critical problem of automating  data integration from a variety of public websites by abstracting key features of multi-dimensional tables and interpreting them in the context of knowledge-centered Unified Spatial Temporal Model. Our classification-driven extractors are trained to identify and classify entities from both structured and unstructured parts of spreadsheets. The unstructured part contained in titles, headers, and footers reveals critical information, so-called Implicit Knowledge, crucial to the correct interpretation of data. Our experimental results on real world datasets from heterogeneous public data sources show increased accuracy by 25% compared to state-of-the-art approaches.

Keywords

big data, data extraction, data integration, information retrieval