Skip to content
forked from cofacts/opendata

Open data of Cofacts collaborative fact-checking database

License

Notifications You must be signed in to change notification settings

andyfcx/opendata

 
 

Repository files navigation

【Cofacts 真的假的】Open Datasets

Build Status

Cofacts data sets includes instant messages reported by Cofacts chatbot users, and the replies written by Cofacts crowd-sourced fact-checking community.

Access the datasets

Creative Commons License

Cofacts Working Group distributes the dataset using Google Drive.

📥 Please fill in this form to access the dataset.

Cofacts 真的假的工作小組使用 Google Drive 發布 Cofacts 所提供資料。

📥 請填寫此份表單存取 Cofacts 所提供資料。

Accessing the Cofacts data means that you agree to the Data User Agreement described in LEGAL.md. In general, Everyone can freely share and adapt the dataset as long as they follow the terms and conditions described in CC BY-SA 4.0 and in LEGAL.md.

In general, when you redistribute Cofacts data outside of LINE application, the attribution specified by Cofacts Working Group is:

This data by Cofacts message reporting chatbot and crowd-sourced fact-checking community is licensed under CC BY-SA 4.0. To provide more info, please visit Cofacts LINE bot https://line.me/ti/p/@cofacts

除非以其他方式議定,否則 Cofacts 真的假的工作小組,針對在 LINE 之外的地方散布的 Cofacts 所提供資料,所指定的中文顯名聲明為:

本編輯資料取自「Cofacts 真的假的」訊息回報機器人與查證協作社群,採 CC BY-SA 4.0 授權提供。若欲補充資訊請訪問 Cofacts LINE bot https://line.me/ti/p/@cofacts

Please see LEGAL.md for more detail.

Terms

LEGAL.md is the user agreement for Cofacts data users that leverages Cofacts data described here or via API.

LICENSE defines the license agreement for the source code in this repository.

Formats

All CSV files are utf-8 encoded and compressed in a zip file.

We use csv-stringify to perform escape and handle quotes.

Fields across different entities

  • userIdsha (string) Hashed user identifier.
  • appId (string) Possible values:
    • LEGACY_APP: Articles collected before 2017-03.
    • RUMORS_LINE_BOT: Articles collected with the current LINE bot client after 2017-03.

The two fields together identifies an unique user across different CSV files. For instance, if one row (reply) in replies.csv and another row (feedback) in article_reply_feedbacks.csv have identical userIdsha and appId, the reply and the feedback are submitted by the same user.

Fields

articles.csv

The instant messages LINE bot users submitted into the database.

Field Data type Description
id String
references Enum string Where the message is from. Currently the only possible value is LINE.
userIdsha String Author of the article.
appId String
normalArticleReplyCount Integer The number of replies are associated to this article, excluding the deleted reply associations.
text Text The instant message text
createdAt ISO time string When the article is submitted to the database.
updatedAt ISO time string Preserved, currently identical to createdAt
lastRequestedAt ISO time string The submission time of the last reply_request is sent on the article, before the article is replied.

article_hyperlinks.csv

Parsed hyperlink contents in each instant messages, parsed using cofacts/url-resolver. The data is used in Cofacts system for indexing and retrieving messages.

Field Data type Description
articleId String
url String The URL string detected in article
normalizedUrl String Canonical URL after normalization process including unfolding shortened URLs
title String Title of the scrapped web content

Note: Scrapped contents do not belong to Cofacts and are redistributed under research purposes. The scrapping mechanism is not reliable either. Researchers may need to implement their own scrapper if content is important in their research.

article_categories.csv

Categories linked to this article.

Field Data type Description
articleId String
categoryId String
aiConfidence Number Confidence level by AI marking this category. Empty for crowd-sourced labels.
aiModel . String Name of the AI model marking this cateogry. Empty for crowd-sourced labels.
userIdsha . String The person that connected article and category.
appId . String
negativeFeedbackCount Integer Number of article_category_feedbacks that has score -1
positiveFeedbackCount Integer Number of article_category_feedbacks that has score 1
status Enum string NORMAL: The category and article are connected. DELETED: The category does not connect to the article anymore.
createdAt ISO time string The time when the reply is connected to the article
updatedAt ISO time string The latest date when the category's status is updated

categories.csv

Field Data type Description
id String
title String Name of the category
description Text Definition of the category
createdAt ISO time string
updatedAt ISO time string

article_replies.csv

Articles and replies are in has-and-belongs-to-many relationship. That is, an article can have multiple replies, and a reply can be connected to multiple similar articles.

article_replies is the "join table" between articles and replies, bringing articleId and replyId together, along with other useful properties related to this connection between an article and a reply.

One pair of articleId, replyId will map to exactly one article_reply.

Field Data type Description
articleId String Relates to id field of articles
replyId String Relates to id field of replies
userId String The user connecting the reply with the article
negativeFeedbackCount Integer Number of article_reply_feedbacks that has score -1
positiveFeedbackCount Integer Number of article_reply_feedbacks that has score 1
replyType Enum string Duplicated from replies's type.
appId String
status Enum string NORMAL: The reply and article are connected. DELETED: The reply does not connect to the article anymore.
createdAt ISO time string The time when the reply is connected to the article
updatedAt ISO time string The latest date when the reply's status is updated

replies.csv

Editor's reply to the article.

Field Data type Description
id String
type Enum string Type of the reply chosen by the editor. RUMOR: The article contains rumor. NOT_RUMOR: The article contains fact. OPINIONATED: The article contains personal opinions. NOT_ARTICLE: The article should not be processed by Cofacts.
reference Text For RUMOR and NOT_RUMOR replies: The reference to support the chosen type and text. For OPINIONATED replies: References containing different perspectives from the article. For NOT_ARTICLE: empty string.
userId String The editor that authored this reply.
appId String
text Text Reply text writtern by the editor
createdAt ISO Time string When the reply is written

reply_hyperlinks.csv

Parsed hyperlink contents in reply text and references, parsed using cofacts/url-resolver. The data is used in Cofacts system for URL previews.

Field Data type Description
replyId String
url String The URL string detected in article
normalizedUrl String Canonical URL after normalization process including unfolding shortened URLs
title String Title of the scrapped web content

Note: Scrapped contents do not belong to Cofacts and are redistributed under research purposes. The scrapping mechanism implementation is not reliable either. Researchers may need to implement their own scrapper if content is important in their research.

reply_requests.csv

Before an article is replied, users may submit reply_requests to indicate that they want this article to be answered.

When an article is first submitted to the article, an reply request is also created. Any further queries to the same article submits new reply_requests.

An user can only submit one reply request to an article.

Field Data type Description
articleId String The target of the request
reason Text The reason why the user wants to submit this reply request
positiveFeedbackCount Text Number of editors think the reason is reasonable
negativeFeedbackCount Text Number of editors think the reason is nonsense
createdAt ISO Time string When the reply request is issued

article_reply_feedbacks.csv

Editors and LINE bot users can express if a reply is useful by submitting article_reply_feedbacks toward a article_reply with score 1 or -1.

The feedback is actually submitted toward an article_reply, the connection between an article and a reply. This is because a reply can be connected to multiple articles. A reply that makes sense in one article does not necessarily mean that it is useful in answering another article. Therefore, the feedback count for a reply connecting to different articles are counted separately.

Field Data type Description
articleId String Relates to articleId of the target article_reply
replyId String Relates to replyId of the target article_reply
score Integer 1: Useful. -1: Not useful.
comment Text Why the user chooses such score for this article reply
createdAt ISO Time string When the feedback is submitted

analytics.csv

Usage (visit / show) statistics of website and Cofacts LINE bot.

LINE bot data starts from April 2nd, 2018; website data starts from May 3rd, 2017.

Field Data type Description
type Enum string Either article or reply
docId String Article ID or Reply ID that is being visited / shown
date ISO Time string The date of usage, represented by start of the day (0:00:00+08:00)
lineUser Integer The number of LINE users who inspected this article / reply in Cofacts LINE bot in this date. May be empty if no such users
lineVisit Integer The number of times this article / reply is inspected in Cofacts LINE bot in this date. May be empty if no visits
webUser Integer The number of web users who visited this article page (/article/<docId>) / reply page (/reply/<docId>) in Cofacts website in this date. May be empty if no such users
webVisit Integer The number of page views of this article page (/article/<docId>) / reply page (/reply/<docId>) in Cofacts website in this date. May be empty if no page views

⚠ [NOTICE] Caveats of using this data ⚠

The methodology we use to collect these data (i.e. how Cofacts works) could have some impact on the data credibility.

How cofacts work

Please keep in mind that all data in this dataset are user-generated, thus is not free from noise and sampling bias coming from these sources:

  • The distribution Cofacts' users may not reflect the real distribution of all LINE users in Taiwan.
  • Users may not use Cofacts in the same way we want them to be. Some articles may not be actual messages circulating in LINE network.
  • replies may contain factual error. All replies should be merely regarded as "responses to the original message (article) to provide different point of view". They are neither the "truth" nor the editor's personal opinion.
  • There may also exist malicious users sending garbage articles into the database. (Previous incident report)
  • The program to collect data and to generate dataset may contain error. The dataset may be inaccurate systematically in this way.

Lastly, the dataset is provided without warrenty.

THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.

Generating opendata files

We generate the opendata files by backing up production DB to local machine, then run this script on local machine.

According to rumors-deploy, the production DB raw data should be available in rumors-deploy/volumes/db-production. (Staging is in db-staging instead).

To backup production DB, Just tar the rumors-deploy/volumes/db-production, download to local machine, extract the tar file and put it in esdata directory of this project's root. esdata should contain only nodes directory now.

Run this to spin up a local elasticsearch for the backed up file

$ docker-compose up

This spins up elasticsearch on localhost:62223, with Kibana available in localhost:62224, using the data in esdata.

Lastly, run this to generate files to data/ directory:

$ npm start

Restore production backup from Cofacts' Google Cloud Storage bucket

For Cofacts production website, the nodes directly is too large to backup using simple zip files. Actually we use Elasticsearch snapshots and Google Cloud Storage Repository plugin to perform backup and restore regularly.

Below is the steps setting up GCS repository and read backups from Google Cloud Storage.

First-time setup

First, spin up local elasticsearch & kibana using docker-compose up.

Secondly, ask a team member for service account credential gcs.json. Put the file to under esdata/.

Open another terminal and execute:

# Install gcs plugin
$ docker-compose exec elasticsearch bin/elasticsearch-plugin install repository-gcs
# Enter "y" when asked to continue

# Install service account credential
$ docker-compose exec elasticsearch bin/elasticsearch-keystore add-file gcs.client.default.credentials_file data/gcs.json

# Restart
$ docker-compose restart elasticsearch

After elasticsearch turns green, go to Kibana and execute the following commands

# Run in Kibana

# Initialize snapshot respository named "cofacts" as GCS repository.
# Since we only read from the repository, turn on "readonly" flag.
#
PUT _snapshot/cofacts
{
  "type": "gcs",
  "settings": {
    "bucket": "rumors-db",
    "readonly": true
  }
}

Loading snapshot from GCS

Before publishing opendata, update your elasticsearch with the following commands in Kibana.

# Gets all snapshots in the repository
GET /_snapshot/cofacts/_all?verbose=false

Find the latest snapshot name (like 2020-07-05 below), then run the following command to restore the snapshot to your local Elasticsearch indices.

# You may need to remove all your local Elasticsearch indices before restore
DELETE /_all

# 2020-07-05 is the snapshot name.
#
POST /_snapshot/cofacts/2020-07-05/_restore
{
  "indices": "*,-urls*"
}

To find out current recovery progress, run this:

GET /_recovery?human&filter_path=*.shards.stage,*.shards.index.size.percent

After all indices are restored, run npm start in CLI to generate opendata files.

About

Open data of Cofacts collaborative fact-checking database

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 96.3%
  • Shell 3.7%