PerDoc and PerDocData: Google's Document Blueprints (Google Leak - Module Types)

Last Updated: July 28th, 2024

In the leaked "Content Warehouse" documentation, "PerDoc" and "PerDocData" emerge as essential components of Google's intricate system for understanding web documents. These data structures represent the vast collection of information Google gathers and analyzes for each page it crawls, encompassing everything from basic metadata to complex quality scores.

PerDoc: The Foundation, PerDocData: The Deep Dive

Think of PerDoc and PerDocData as comprehensive profiles, containing hundreds of attributes that capture a document's essence:

  • PerDoc: Holds fundamental document properties like URL, language, content type, crawl status, and technical SEO signals (canonicalization, redirects). It serves as the foundation upon which PerDocData builds.
  • PerDocData: Dives deeper into a wide spectrum of ranking signals related to content quality, authority, user engagement, mobile-friendliness, spam detection, freshness, and more.

Examples of PerDoc and PerDocData Modules (Non-Exhaustive)

  • PerDocData: This module itself is a primary example, containing numerous fields related to spam scores, freshness, language, and authority signals.
  • VideoPerDocData: Contains data specific to video content, such as duration, encoding format, and potential quality or engagement metrics.
  • SmartphonePerDocData and MobilePerDocData: Hold signals related to mobile optimization and user experience, such as mobile-friendliness, interstitial penalties, and page speed metrics.
  • PremiumPerDocData: Contains information relevant to paid or premium content, such as subscription status, pricing, and access restrictions.
  • BlogPerDocData: Represents data specific to blog posts, potentially including author information, comment counts, and social sharing metrics.
  • ImageQualitySensitiveFaceSkinToneSignalsPerDocData: Relates to image analysis, potentially indicating Google's efforts to assess diversity and representation within images.
  • QualityFringeFringeQueryPriorPerDocData: Contains signals related to a document’s potential association with fringe or controversial topics.
  • QualityRichsnippetsAppsProtosLaunchAppInfoPerDocData: Holds information about launching mobile apps related to the document’s content.
  • RepositoryWebrefPerDocRelevanceRating: Holds human ratings or assessments of a document’s relevance to specific topics or entities.

Unveiling Google's Evaluation Process

The leaked documentation's references to PerDoc and PerDocData provide a valuable framework for understanding how Google might be evaluating web pages. By analyzing the attributes and modules associated with these data structures, SEOs can gain insights into the potential factors that influence search rankings.

Key Takeaways

  • Comprehensive Analysis: Google's assessment of web pages is incredibly comprehensive, encompassing a wide range of factors beyond basic content and keywords.
  • Signal Integration: PerDoc and PerDocData integrate signals from various sources and systems, creating a holistic picture of a page's characteristics and value.
  • Constant Evolution: Google's algorithms are constantly evolving. The specific signals and their weights may change over time, requiring SEOs to stay informed and adapt their strategies accordingly.

Conclusion

While the leaked documents don't reveal the precise formula behind Google's ranking algorithms, they offer a valuable glimpse into the types of data Google collects and analyzes. By understanding these data structures and the signals they contain, SEOs can gain a deeper understanding of how Google evaluates web pages and make more informed optimization decisions.