Skip to main content

Disparity in Data

Tldr: This section explains that the data you see is potentially not real-time data.

Getting Started

vTual is committed to providing information about the activities of content creators across various platforms.

However, we must acknowledge certain limitations regarding this commitment, as well as the goals and mission we aim to achieve. While we strive to deliver comprehensive updates, there are factors that may affect our ability to fully realize this vision.

Affected Things

Some of the data affected by the disparity in data includes feeds for YouTube and Twitch.

YouTube Feeds

At vTual, content uploaded or created and made publicly available on the YouTube platform is referred to as the "YouTube Feed." There are two types of data processing within the YouTube Feed: Initial and Checker.

Initial

The Initial function serves as the entry point for all content from your YouTube channel, exporting the entire collection of your channel’s content into your vTual profile.

The data handled by the Initial function is raw data from Youtube, as it lacks sufficient information. And so, by default data from this Initial function is not displayed on the vTual system.

The Initial function runs every 10 minutes, thus it becomes the first reason for disparity in Youtube-related data. Because during that period there may be new data, but we can only retrieve the information 10 minutes later.

Checker

Raw data taken directly from YouTube will then be checked and processed by the Checker function, which later adds additional details such as view counts, published date or even schedule dates.

The Initial function runs every 3 minutes, thus it becomes the second reason for disparity in Youtube-related data. Because during that period there may be new value for the amout of viewer, but we can only retrieve the information 3 minutes later.

Twitch Feeds

At vTual, content streamed and made publicly available on the Twitch platform is referred to as the "Twitch Feed." Currently there are only one types of data processing within the Twitch Feed: Checker.

Checker

Raw data taken directly from Twitch will then be checked and processed by the Checker function, which later adds additional details such as view counts, published date or even schedule dates.

The Initial function runs every 3 minutes, thus it becomes the first reason for disparity in Twitch-related data. Because during that period there may be new value for the amout of viewer, but we can only retrieve the information 3 minutes later.

Challenges in the System

vTual handles and manages a large volume of data, which presents a unique challenge in both development and maintenance.

We employ various strategies and efforts to make this process more efficient and streamlined, while ensuring that the implementation remains cost-effective. But in its implementation, some of these things also cause disparity in data.

Chunk

In a single data processing cycle, vTual typically needs to handle hundreds to thousands of data entries within the same period. And this can put a significant load on the server.

To address this, vTual employs a method of data processing that divides the workload into smaller parts, known as chunks. Instead of processing 1,000 data entries all at once, we process 100 entries at a time over ten cycles. Well, in theory it's something like that. This approach helps to distribute the server load more evenly and ensures smoother operations.

This approach allows the server to work and process data more efficiently, reducing the overall strain. However, it may sometimes result in longer processing times before the data is fully available, thus it becomes another reason for disparity in data.

Database Pool

I repeat that vTual handles and manages a large volume of data, thus heavier the load that needs to be supported, the stronger the support must be.

In this case, vTual opts for a horizontal scaling approach rather than vertical scaling. In other words, vTual prefers to have 100 people each lifting 1 kilogram rather than a single person lifting 100 kilograms. This strategy ensures a more balanced and resilient system.

In simple terms, vTual prefers to store and manage data across several smaller servers rather than relying on a single large, centralized server. We believe this approach is safer, more efficient, and cost-effective compared to depending on one large server.

However, this approach requires a pooler to collect and synchronize data from each server involved in the data processing. The pooler need to ensures that all information is accurately aggregated and up-to-date across the distributed system, thus it becomes another reason for disparity in data.

Verdict

It turns out that developing and maintaining a project like vTual is not as simple as it seems. The costs involved also have significantly exceeded our initial expectations. Nevertheless, we will remain committed to delivering the best possible experience for all our users despite all of the challenges.