Automated conflation of multi source point of interest (POIs) for creation of best of breed datasets
Student name: Ms Poonam Chandel
Guide: Dr Vinay Shankar Prasad Sinha
Year of completion: 2016
Host Organisation: Pitney Bowes Software India Pvt. Ltd.
Supervisor (Host Organisation): Dr Neena Priyanka
Abstract: This study is an attempt to create best of the breed dataset by automatically conflating POIs
of USA and UK of categories like schools, hospitals, restaurants, banks, hotels and post
offices extracted by two different sources (Yellow Pages and Bing Maps). As Conflation
combines the best quality elements of both the datasets to create a composite dataset that is
better representation of that POI. Entire study is carried out in two stages for both the
countries. In the first stage the data is scraped from Yellow pages and Bing maps websites
by using trial version of Yellabot and Local scraping software and then in Spectrum
Technology Platform a flow is created where all the attributes on the basis of which data is
to be conflated is defined along with the algorithms like Soundex Metaphone, Metaphone 3
and Double Metaphone. To test the accuracy of the process, false positive (wrongly
conflated) and false negative (not conflated) cases are identified. In USA dataset, there are
no such false positive cases in all the categories in each stage whereas the false negative
cases in the first stage was maximum in banks category (0.59%) followed by schools (0.27)
and post offices (0.27%) and almost negligible in the remaining categories. In second stage
to further reduce these false negative cases the data of both the sources is standardized to
remove the nuisance in the data and to maintain the uniformity within the dataset and the
algorithm in spectrum platform is replaced by edit distance which increase the number
conflated results as well as reduces the number of false negative cases by almost 50%.
Keywords: POI, Conflation, Standardization, False Negative, Spectrum Technology
Platform.