InCharts Data Crawler under the hood

A Swedish version of this blog post is available here.

The gun violence events that are presented on this website are a result of a largely automatic process. It involves collecting of data, trying to find out what pieces of information that actually tells about a shooting event, and organizing and combining data from multiple sources (in order to validate the information and to collect as much details as possible).

Sources and validation

Ideally, a data source would provide information that is

  • Instant
  • Totally reliable
  • Complete (all relevant details included)
  • In a well-structured format that is easy to parse programmatically

Unfortunately, reality isn't that simple.

Classification of sources

One way to classify a source of information is by reliability. On the lower end of the reliability scale are unverified gossip and anonymous comments on social media, while official information published by the police authority is considered to be located on the opposite end.

Another way to classify an information source is by how well structured the data is. This is essential for a website like this one, that relies on an automatic process to collect and analyze data. Newspaper articles are often unstructured, since they are written using natural language. Information regarding the same event could be written in millions of different ways, depending on the author's writing style.

The Swedish police authority is providing information about events through an API which makes the structure of the information both well-defined and documented.

One might think that a source that is both highly reliable and well-structured would be the ideal information source. There are however even more aspects to consider, such as speed and completeness. Official sources don't always include all events that occurs, and, compared to for example newspapers, they are also often not as fast to publish information.

The concept of collecting and validating data

The process that ends up on InCharts as a verified event and a set of details regarding that event could be divided into three steps:

  • Crawling
  • Validation
  • Manual Review

For the crawling part, the idea is to collect information from as many sources as possible, flagging the events we find that might be of interest. Reliability is not that important at this stage; incorrect information might be included and filtered out later. Setting the reliability requirement too high comes with the disadvantage that correct information might be excluded as well.

At the automatic validation step, information from unreliable sources will be checked against information from more reliable sources.

Flowchart, data crawling, validation and review

The above figure illustrates how well-structured information that is not considered to be very reliable is successfully validated against a more reliable source with less structured information.

When we know almost nothing what to look for, it's difficult to analyze and find information from an unstructured source. It's a lot easier to check that the source contains specific information that is already known. Here we make use of the advantage that we already have structured information from another source.

When unreliable, well-structured data is confirmed with data from a reliable unstructured source, the unreliable data is upgraded to be considered reliable. Information from the unstructured source can in turn be more easily extracted, and contributing with more relevant details surrounding an event.

As soon as data is validated against a single other source, the analysis tools will create a monitored event. For a monitored event, as the name suggests, sources are continuously monitored so that additional information can be collected and added as soon as they are found.

The last step in the list above is the manual review. It's out of the scope for this article, but it's nevertheless very important. Before any information is made public, the system notifies a human being about what information that has been collected and compiled. How intelligent the automation is or will be, incorrect information will sometimes slip through the system. The manual review gives an opportunity to correct this, making every effort to ensure that what is published on InCharts is accurate.