2019 GitHub 開源貢獻排行榜新鮮出爐!微軟谷歌領頭,阿里躋身前 12!

本文由 yanglbme 原創,首發於公衆號「Doocs開源社區」,禁止未受權轉載。git

數據來源

基礎數據來自:www.gharchive.orggithub

統計方法

獲取 GitHub 2019 年的 PushEvent,經過分析 GitHub 用戶提交記錄中的郵件地址,分辨其所屬組織。web

具體方法參考:www.freecodecamp.org/news/the-to…sql

分析工具

  • Google Big Query
  • Data Studio

SQL 語句

因爲 Google Big Query 每個月只能免費獲取 1TB 的數據處理量,所以,爲了充分利用它,咱們將數據查詢限制在必定的日期範圍(20190301-20191001)內,確保數據處理量接近而不超過 1TB。dom

此日期範圍內的數據可大體反映 2019 整年 GitHub 各組織開源貢獻度狀況。工具

SELECT *
FROM `githubarchive.month.2019*` a
WHERE _TABLE_SUFFIX BETWEEN '0301' AND '1001'
複製代碼

完整的 SQL 語句編寫以下:google

#standardSQL
WITH
period AS (
  SELECT *
  FROM `githubarchive.month.2019*` a
  WHERE _TABLE_SUFFIX BETWEEN '0301' AND '1001'
),
repo_stars AS (
  SELECT repo.id, COUNT(DISTINCT actor.login) stars, APPROX_TOP_COUNT(repo.name, 1)[OFFSET(0)].value repo_name 
  FROM period
  WHERE type='WatchEvent'
  GROUP BY 1
  HAVING stars>20
), 
pushers_guess_emails_and_top_projects AS (
  SELECT *, REGEXP_EXTRACT(email, r'@(.*)') domain
  FROM (
    SELECT actor.id
      , APPROX_TOP_COUNT(actor.login,1)[OFFSET(0)].value login
      , APPROX_TOP_COUNT(JSON_EXTRACT_SCALAR(payload, '$.commits[0].author.email'),1)[OFFSET(0)].value email
      , COUNT(*) c
      , ARRAY_AGG(DISTINCT TO_JSON_STRING(STRUCT(b.repo_name,stars))) repos
    FROM period a
    JOIN repo_stars b
    ON a.repo.id=b.id
    WHERE type='PushEvent'
    GROUP BY  1
    HAVING c>3
  )
)
SELECT * FROM (
  SELECT domain
    , githubers
    , (SELECT COUNT(DISTINCT repo) FROM UNNEST(repos) repo) repos_contributed_to
    , ARRAY(
        SELECT AS STRUCT JSON_EXTRACT_SCALAR(repo, '$.repo_name') repo_name
        , CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64) stars
        , COUNT(*) githubers_from_domain FROM UNNEST(repos) repo 
        GROUP BY 1, 2 
        HAVING githubers_from_domain>1 
        ORDER BY stars DESC LIMIT 3
      ) top
    , (SELECT SUM(CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64)) FROM (SELECT DISTINCT repo FROM UNNEST(repos) repo)) sum_stars_projects_contributed_to
  FROM (
    SELECT domain, COUNT(*) githubers, ARRAY_CONCAT_AGG(ARRAY(SELECT * FROM UNNEST(repos) repo)) repos
    FROM pushers_guess_emails_and_top_projects
    #WHERE domain IN UNNEST(SPLIT('google.com|microsoft.com|amazon.com', '|'))
    WHERE domain NOT IN UNNEST(SPLIT('gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru', '|')) # email hosters
    GROUP BY 1
    HAVING githubers > 30
  )
  WHERE (SELECT MAX(githubers_from_domain) FROM (SELECT repo, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo  GROUP BY repo))>4 # second filter email hosters
)
ORDER BY githubers DESC
複製代碼

從下圖中能夠看到,本次查詢統計將會處理 918.4GB 的數據。 spa

統計結果

點擊運行,通過 17.8s,咱們能夠看到查詢結果。 3d

頂級組織比較

從上圖咱們能夠看出:

  • 微軟谷歌在開源貢獻度上遙遙領先,位列 3-5 位的分別是 redhat、intel 和 amazon;
  • 微軟谷歌均有超過 1000 名員工(githubers)向多個 GitHub 倉庫(repos_contributed_to) push 代碼;
  • 對於微軟,2019 Top3 倉庫分別是 Terminal、vscode 和 TypeScript,而谷歌則是 flutter、tensorflow 和 kubernetes。

排在 6-10 位的分別是 Pivotal、Facebook、Apache、SAP 和 Shopify。 code

國內大廠比較

國內大廠開源貢獻度最高的當屬阿里員工,排在第十二位,top3 倉庫分別是 flutter-go、nacos 和 sqlflow,全部項目共得到 stars 數超過 90000。

百度和騰訊則分列 2一、23 位。

總覽

開源貢獻度前 38 位名單以下:

有什麼想法,歡迎留言區與我互動,也歡迎關注個人公衆號「Doocs開源社區」,原創技術文章第一時間推送!

相關文章
相關標籤/搜索