Introduction
- Github: https://github.com/GoatWang/ithome_ironman
- Reader Feedback:
Learning Note
在網頁的取得上,因為每次去要求server回傳html檔時,都要等待回應一段時間,此時client端(也就是你的電腦)其實是沒有在運算的,因此若能夠使用這段時間,發出其他要求,將可大大增加爬取的速度。不過,非同步技術的概念其實相當複雜,甚至牽涉到一些硬體的知識,比較主要的難點在於與「多執行續」的差異解釋,這裡就不多加解釋,有興趣可以自行google。
概述一些我接觸過的一些套件,讓大家對爬蟲的「技術鍊」、以及「常見的問題及其解決方式」有基礎的了解。
這篇文章主要是寫給剛開始學習Python爬蟲的初學者,由於自己剛開始學習這部分知識時,所有的套件名詞猶如雪片般飛來,有時會錯誤的理解一個套件的使用方式,有時則對某個套件期待過高,學成時總覺得不過爾爾,有種失落感。因此著述。
Due to facebook api’s change on its query string. The built-in connecting system to facebook has some error. Unlike google authencation, you only have to input the ClientId, ClientSecret and set down the google end app, then you can successfully connect to google Api. However, you have to redefine the querystring in facebook authencation.
At first, I try to use mlab built-in backup system. However, it’s not include in its free 500mb program. As a result, I wrote a C# program to back up by myself.
There are some keywords about the model for you to consider: unsupervised learning, LSTM, encode(translate) a word to a vector. Actually, I don’t know exactly about the theory of word2vec. What I can tell you is that, using models training from a corpus by this method, you can find related words a specific word or a words list. In addition, the most famous point of this theory is that the vectors of retlated words translated by this model can be caculated. For example, the (vector of) king - man = queen - woman. That is, you can caculate Taiwan’s vector by France - Paris + Taipei.
From scikit-learn:
This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. The log loss is only defined for two or more labels. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is
$$-log P(yt|yp) = -(yt\ log(yp) + (1 - yt) log(1 - yp))$$
Indri is a powerful IR tool. For more information, you can go to their home page. To be mentioned in this article, although they assert that it can be set up on Windows, it is quite a hard work. As a result, I wnat to write down the process about how to set it up on windows.
在評價資訊檢索時,人們在意的指標有很多面相,在過去比較重要的像是搜尋的數量跟速度,但隨著科技的進步,現在更趨向於不同面相精準,這也是本章節的重點。不過值得一提的是使用者介面(UI)、使用者體驗(UX)也是在這個領域當中有人持續關注及研究的議題,例如google使用的top-10 result method一個頁面中只回傳前十筆相關資料,又或者是Searchme Visual Search提供一個創新的搜尋結果可預覽的呈現方式,然而這些資訊實在太難以被量化研究,因此,本章將主要聚焦在精準度的呈現。