Horizon -- Sky's the Limit

horizon.tech

Extracting Metadata and Geospatial Data from a PDF

In this tutorial we'll show you how to use structured AI workflows and the upload document feature to reliably extract structured data from a PDF, plotting polygons on the Space map.

In this Horizon feature you will be using some code along with Horizon to orchestrate a repeatable pipeline for extracting structured data from pdfs, utilizing AI to set text.

In essence, Horizon spaces are used here as a visual database, and workflows are used to track the status of your extraction pipeline.

_ ->

The geospatial extraction pipeline in action

Step 1: Identify your data source

In our tutorial we use the Mining Bulletin of Chile. Every day, mining concessions (deeds that authorize a certain entity to explore land for potential mining) are posted to the bulletin.

Step 2: Use a single AI workflow with pdfs to get an appropriate json schema

You'll want to manually trigger a PDF -> OCR -> AI workflow at first to extract an appropriate metadata schema that you can use for subsequent documents. This schema will ensure that you extract metadata to the correct fields every time, ensuring normalization of your application.

Once you've performed this, you can manually enter the json schema in the Horizon UI as a specific table. (In the future, Horizon will allow you to upload a json schema directly).

Take note of the layer ID.

Step 3: Create your repeatable data transaction workflow.

You will need to upload a pdf and make it available as a document. In the 'workflows' page you create a 'trigger' node that launches the workflow any time a pdf is uploaded to a specific space.

You use a 'resource' node to and assign it to pdfs of the space you choose.

Then, connect it to an 'AI transformation' node. There are three AI transformations to be run in parallel.

The first, is to extract the paragraph geospatial information in the format accepted by Horizon (geojson coordinates, srid 4326).

The second is to identify a metadata structure that can serve as a schema for future extractions. You can pass a json format here.

The third is to pass the metadata structure as the json format to an AI transformation, and extract information according to the schema.

Then, you will want to 'dispatch' a combined data point to be written to the table.

Finally, you will connect another 'dispatch' node to the output of the other dispatch node and to the geojson extraction in order to set the geospatial information.

Note: this can be written as a workflow, allowing users to modify and share the workflow for others, or it can be written as direct API calls if you expect to hardcode space ID and layer ID. Finally, this can be a plugin with mappers.

Note that this extraction pipeline can be generalized.

Step 4 (optional): Set your table to be exposed as an API and place it on the marketplace

AI extraction, after a few thousand documents, can be valuable and a little pricey. Contact Horizon to learn how to monetize and control API access to the data you've collected.