Skip to content

Latest commit

 

History

History
211 lines (185 loc) · 54.1 KB

README.md

File metadata and controls

211 lines (185 loc) · 54.1 KB

Wikipedia Dataset Generator

You have now total control over the balance between the article quality and the size of your own Wikipedia dataset. This means that you can generate a small dataset of very high quality articles, a large dataset of mixed qulality article, or decide your own balance.

With the first script, download-pageviews.py, you can automatically download pageview files from the Wikimedia servers. But what are pageviews, you may ask. It's a database of over 8 years of wikipedia page views statitics that is precise down to the hour. From this set of files, you can estimate the popularity and quality of the articles you want to include in your own dataset. If you want your language model to know about the latest events, select a range of recent start (-s) and end (-e) dates. Need something more flexible and reliable? You can use the 8 years of statistics for yourself.

The second script, process-pageviews.py, filters out unwanted articles based on specific criterias, such as excluding articles based on their title, or most importantly their popularity. In addition, this parameter is calculated by making the averages of all the pageview files you downloaded, allowing very high realiability.

The third script, generate-dataset.py, reads the filtered articles and cleans them up to create a JSON file with a structured format. The script removes unwanted text, formats and converts units, and removes odd characters and symbols. The resulting JSON file contains a list of articles with their titles and contents.

Premade datasets

I uploaded a dataset on Hugging Face using this project : ThomasBaruzier/wikipedia.

Get started

The English Wikipedia datadump : [Torrent file] [List of other torrents]

To install the required packages, run:

pip install -r requirements.txt

Downloading Pageviews

To download Wikipedia pageview data, run download-pageviews.py:

python download-pageviews.py [-c COUNT] [-s START_DATE] [-e END_DATE] [-n]
  • -c COUNT: The number of files to download (default is 10).
  • -s START_DATE: The start date in YYYYMMDD format (default is 1 year ago).
  • -e END_DATE: The end date in YYYYMMDD format (default is the latest available pageview file).
  • -n: Exports the URLs to a file instead of downloading them.

Processing Pageviews

To process the downloaded pageview data and generate a list of article titles, run process-pageviews.py:

python process-pageviews.py [THRESHOLD]
  • [THRESHOLD]: The minimum number of pageviews for an article to be included in the output (default is 50).

Generating the Dataset

To generate the final dataset, run generate-dataset.py:

python generate-dataset.py

The script reads in the list of article titles generated by process-pageviews.py and processes the corresponding Wikipedia articles to generate the final dataset. The output file is dataset.json. Expect 2 hours of generation time. You can CTRL+C whenever you want, the JSON file will still be valid, but there's a catch: you can't resume your progress (PR if anyone wants to).

Limitations

  • Tested for the english Wikipedia dump only.
  • Removes equations and other compilcated stuff such as foreign languages tranductions.
  • I fought hard to keep the error rate low, but it still fails, sometimes. The common error is associated with the filtering of complex structures, leaving non-sence sentences here and there.
  • You need to download the dump by yourself using a torrenting client.

Example of generated data

[
  {
    "title": "Asteroid, Definition",
    "content": "An asteroid is a minor planet of the inner Solar System. Sizes and shapes of asteroids vary significantly, ranging from 1-meter rocks to a dwarf planet almost 1000 km in diameter; they are rocky, metallic or icy bodies with no atmosphere.\nAsteroids have been historically observed from Earth; the \"Galileo\" spacecraft provided the first close observation of an asteroid. Several dedicated missions to asteroids were subsequently launched by NASA and JAXA, with plans for other missions in progress. NASA's \"NEAR Shoemaker\" studied Eros, and \"Dawn\" observed Vesta and Ceres. JAXA's missions \"Hayabusa\" and \"Hayabusa2\" studied and returned samples of Itokawa and Ryugu, respectively. OSIRIS-REx studied Bennu, collecting a sample in 2020 to be delivered back to Earth in 2023. NASA's \"Lucy\", launched in 2021, will study ten different asteroids, two from the main belt and eight Jupiter trojans. \"Psyche\", scheduled for launch in 2023, will study a metallic asteroid of the same name.\nNear-Earth asteroids can threaten all life on the planet; an asteroid impact event resulted in the Cretaceous-Paleogene extinction. Different asteroid deflection strategies have been proposed; the Double Asteroid Redirection Test spacecraft, or DART, was launched in 2021 and intentionally impacted Dimorphos in September 2022, successfully altering its orbit by crashing into it."
  },
  {
    "title": "Asteroid, History of observations",
    "content": "Only one asteroid, 4 Vesta, which has a relatively reflective surface, is normally visible to the naked eye. When favorably positioned, 4 Vesta can be seen in dark skies. Rarely, small asteroids passing close to Earth may be visible to the naked eye for a short time., the Minor Planet Center had data on 1,199,224 minor planets in the inner and outer Solar System, of which about 614,690 had enough information to be given numbered designations."
  },
  {
    "title": "Asteroid, Discovery of Ceres",
    "content": "Bode's formula predicted another planet would be found with an orbital radius near 2.8 astronomical units, or 420 million km, from the Sun. The Titius-Bode law got a boost with William Herschel's discovery of Uranus near the predicted distance for a planet beyond Saturn. In 1800, a group headed by Franz Xaver von Zach, editor of the German astronomical journal \"Monatliche Correspondenz\", sent requests to 24 experienced astronomers, asking that they combine their efforts and begin a methodical search for the expected planet. Although they did not discover Ceres, they later found the asteroids 2 Pallas, 3 Juno and 4 Vesta.\nThe light was a little faint, and of the colour of Jupiter, but similar to many others which generally are reckoned of the eighth magnitude. Therefore I had no doubt of its being any other than a fixed star... The evening of the third, my suspicion was converted into certainty, being assured it was not a fixed star. Nevertheless before I made it known, I waited till the evening of the fourth, when I had the satisfaction to see it had moved at the same rate as on the preceding days.\nPiazzi observed Ceres a total of 24 times, the final time on 11 February 1801, when illness interrupted his work. He announced his discovery on 24 January 1801 in letters to only two fellow astronomers, his compatriot Barnaba Oriani of Milan and Bode in Berlin. He reported it as a comet but \"since its movement is so slow and rather uniform, it has occurred to me several times that it might be something better than a comet\". In April, Piazzi sent his complete observations to Oriani, Bode, and French astronomer Jérôme Lalande. The information was published in the September 1801 issue of the \"Monatliche Correspondenz\".\nBy this time, the apparent position of Ceres had changed, and was too close to the Sun's glare for other astronomers to confirm Piazzi's observations. Toward the end of the year, Ceres should have been visible again, but after such a long time it was difficult to predict its exact position. To recover Ceres, mathematician Carl Friedrich Gauss, then 24 years old, developed an efficient method of orbit determination. Piazzi named the newly discovered object \"Ceres Ferdinandea\" \"in honor of the patron goddess of Sicily and of King Ferdinand of Bourbon\"."
  },
  {
    "title": "Asteroid, Further search",
    "content": "Three other asteroids were discovered by von Zach's group over the next few years, with Vesta found in 1807.\nIn 1891, Max Wolf pioneered the use of astrophotography to detect asteroids, which appeared as short streaks on long-exposure photographic plates. whereas only slightly more than 300 had been discovered up to that point. It was known that there were many more, but most astronomers did not bother with them, some calling them \"vermin of the skies\", a phrase variously attributed to Eduard Suess and Edmund Weiss. Even a century later, only a few thousand asteroids were identified, numbered and named."
  },
  {
    "title": "Asteroid, 19th and 20th centuries",
    "content": "In the past, asteroids were discovered by a four-step process. First, a region of the sky was photographed by a wide-field telescope, or astrograph. Pairs of photographs were taken, typically one hour apart. Multiple pairs could be taken over a series of days. Second, the two films or plates of the same region were viewed under a stereoscope. A body in orbit around the Sun would move slightly between the pair of films. Under the stereoscope, the image of the body would seem to float slightly above the background of stars. Third, once a moving body was identified, its location would be measured precisely using a digitizing microscope. The location would be measured relative to known star locations.\nThese first three steps do not constitute asteroid discovery: the observer has only found an apparition, which gets a provisional designation, made up of the year of discovery, a letter representing the half-month of discovery, and finally a letter and a number indicating the discovery's sequential number. The last step is sending the locations and time of observations to the Minor Planet Center, where computer programs determine whether an apparition ties together earlier apparitions into a single orbit. If so, the object receives a catalogue number and the observer of the first apparition with a calculated orbit is declared the discoverer, and granted the honor of naming the object subject to the approval of the International Astronomical Union."
  },
  {
    "title": "Asteroid, Naming",
    "content": "By 1851, the Royal Astronomical Society decided that asteroids were being discovered at such a rapid rate that a different system was needed to categorize or name asteroids. In 1852, when de Gasparis discovered the twentieth asteroid, Benjamin Valz gave it a name and a number designating its rank among asteroid discoveries, 20 Massalia. Sometimes asteroids were discovered and not seen again. So, starting in 1892, new asteroids were listed by the year and a capital letter indicating the order in which the asteroid's orbit was calculated and registered within that specific year. For example, the first two asteroids discovered in 1892 were labeled 1892A and 1892B. However, there were not enough letters in the alphabet for all of the asteroids discovered in 1893, so 1893Z was followed by 1893AA. A number of variations of these methods were tried, including designations that included year plus a Greek letter in 1914. A simple chronological numbering system was established in 1925.\nCurrently all newly discovered asteroids receive a provisional designation consisting of the year of discovery and an alphanumeric code indicating the half-month of discovery and the sequence within that half-month. Once an asteroid's orbit has been confirmed, it is given a number, and later may also be given a name. The formal naming convention uses parentheses around the number - e.g. Eros - but dropping the parentheses is quite common. Informally, it is also common to drop the number altogether, or to drop it after the first mention when a name is repeated in running text. In addition, names can be proposed by the asteroid's discoverer, within guidelines established by the International Astronomical Union."
  },
  {
    "title": "Asteroid, Symbols",
    "content": "The first asteroids to be discovered were assigned iconic symbols like the ones traditionally used to designate the planets. By 1855 there were two dozen asteroid symbols, which often occurred in multiple variants.\nIn 1851, after the fifteenth asteroid, Eunomia, had been discovered, Johann Franz Encke made a major change in the upcoming 1854 edition of the \"Berliner Astronomisches Jahrbuch\". He introduced a disk, a traditional symbol for a star, as the generic symbol for an asteroid. The circle was then numbered in order of discovery to indicate a specific asteroid. The numbered-circle convention was quickly adopted by astronomers, and the next asteroid to be discovered was the first to be designated in that way at the time of its discovery. However, Psyche was given an iconic symbol as well, as were a few other asteroids discovered over the next few years. 20 Massalia was the first asteroid that was not assigned an iconic symbol, and no iconic symbols were created after the 1855 discovery of 37 Fides."
  },
  {
    "title": "Asteroid, Terminology",
    "content": "Traditionally, small bodies orbiting the Sun were classified as comets, asteroids, or meteoroids, with anything smaller than one meter across being called a meteoroid. The term \"asteroid\" never had a formal definition, with the broader term \"small Solar System bodies\" being preferred by the International Astronomical Union. As no IAU definition exists, \"asteroid\" can be defined as \"an irregularly shaped rocky body orbiting the Sun that does not qualify as a planet or a dwarf planet under the IAU definitions of those terms\".\nWhen found, asteroids were seen as a class of objects distinct from comets, and there was no unified term for the two until \"small Solar System body\" was coined in 2006. The main difference between an asteroid and a comet is that a comet shows a coma due to sublimation of near-surface ices by solar radiation. A few objects have ended up being dual-listed because they were first classified as minor planets but later showed evidence of cometary activity. Conversely, some comets are eventually depleted of their surface volatile ices and become asteroid-like. A further distinction is that comets typically have more eccentric orbits than most asteroids; \"asteroids\" with notably eccentric orbits are probably dormant or extinct comets.\nFor almost two centuries, from the discovery of Ceres in 1801 until the discovery of the first centaur, 2060 Chiron in 1977, all known asteroids spent most of their time at or within the orbit of Jupiter, though a few such as 944 Hidalgo ventured far beyond Jupiter for part of their orbit. When astronomers started finding more small bodies that permanently resided further out than Jupiter, now called centaurs, they numbered them among the traditional asteroids. There was debate over whether these objects should be considered asteroids or given a new classification. Then, when the first trans-Neptunian object, 15760 Albion, was discovered in 1992, and especially when large numbers of similar objects started turning up, new terms were invented to sidestep the issue: Kuiper-belt object, trans-Neptunian object, scattered-disc object, and so on. They inhabit the cold outer reaches of the Solar System where ices remain solid and comet-like bodies are not expected to exhibit much cometary activity; if centaurs or trans-Neptunian objects were to venture close to the Sun, their volatile ices would sublimate, and traditional approaches would classify them as comets and not asteroids.\nThe innermost of these are the Kuiper-belt objects, called \"objects\" partly to avoid the need to classify them as asteroids or comets. They are thought to be predominantly comet-like in composition, though some may be more akin to asteroids. Furthermore, most do not have the highly eccentric orbits associated with comets, and the ones so far discovered are larger than traditional comet nuclei. Other recent observations, such as the analysis of the cometary dust collected by the \"Stardust\" probe, are increasingly blurring the distinction between comets and asteroids, suggesting \"a continuum between asteroids and comets\" rather than a sharp dividing line.\nThe minor planets beyond Jupiter's orbit are sometimes also called \"asteroids\", especially in popular presentations. However, it is becoming increasingly common for the term \"asteroid\" to be restricted to minor planets of the inner Solar System. Therefore, this article will restrict itself for the most part to the classical asteroids: objects of the asteroid belt, Jupiter trojans, and near-Earth objects.\nWhen the IAU introduced the class small Solar System bodies in 2006 to include most objects previously classified as minor planets and comets, they created the class of dwarf planets for the largest minor planets - those that have enough mass to have become ellipsoidal under their own gravity. According to the IAU, \"the term 'minor planet' may still be used, but generally, the term 'Small Solar System Body' will be preferred\". Currently only the largest object in the asteroid belt, Ceres, at about 975 km across, has been placed in the dwarf planet category."
  },
  {
    "title": "Asteroid, Formation",
    "content": "Many asteroids are the shattered remnants of planetesimals, bodies within the young Sun's solar nebula that never grew large enough to become planets. Ceres and Vesta grew large enough to melt and differentiate, with heavy metallic elements sinking to the core, leaving rocky minerals in the crust.\nIn the Nice model, many Kuiper-belt objects are captured in the outer asteroid belt, at distances greater than 2.6 AU. Most were later ejected by Jupiter, but those that remained may be the D-type asteroids, and possibly include Ceres."
  },
  {
    "title": "Asteroid, Asteroid belt",
    "content": "Contrary to popular imagery, the asteroid belt is mostly empty. The asteroids are spread over such a large volume that reaching an asteroid without aiming carefully would be improbable. Nonetheless, hundreds of thousands of asteroids are currently known, and the total number ranges in the millions or more, depending on the lower size cutoff. Over 200 asteroids are known to be larger than 100 km, and a survey in the infrared wavelengths has shown that the asteroid belt has between 700,000 and 1.7 million asteroids with a diameter of 1 km or more. The absolute magnitudes of most of the known asteroids are between 11 and 19, with the median at about 16.\nThe total mass of the asteroid belt is estimated to be kg, which is just 3% of the mass of the Moon; the mass of the Kuiper Belt and Scattered Disk is over 100 times as large. The four largest objects, Ceres, Vesta, Pallas, and Hygiea, account for maybe 62% of the belt's total mass, with 39% accounted for by Ceres alone."
  },
  {
    "title": "Asteroid, Trojans",
    "content": "Trojans are populations that share an orbit with a larger planet or moon, but do not collide with it because they orbit in one of the two Lagrangian points of stability, and, which lie 60° ahead of and behind the larger body.\nIn the Solar System, most known trojans share the orbit of Jupiter. They are divided into the Greek camp at and the Trojan camp at. More than a million Jupiter trojans larger than one kilometer are thought to exist, of which more than 7,000 are currently catalogued. In other planetary orbits only nine Mars trojans, 28 Neptune trojans, two Uranus trojans, and two Earth trojans, have been found to date. A temporary Venus trojan is also known. Numerical orbital dynamics stability simulations indicate that Saturn and Uranus probably do not have any primordial trojans."
  },
  {
    "title": "Asteroid, Near-Earth asteroids",
    "content": "Near-Earth asteroids, or NEAs, are asteroids that have orbits that pass close to that of Earth. Asteroids that actually cross Earth's orbital path are known as \"Earth-crossers\"., a total of 28,772 near-Earth asteroids were known; 878 have a diameter of one kilometer or larger.\nA small number of NEAs are extinct comets that have lost their volatile surface materials, although having a faint or intermittent comet-like tail does not necessarily result in a classification as a near-Earth comet, making the boundaries somewhat fuzzy. The rest of the near-Earth asteroids are driven out of the asteroid belt by gravitational interactions with Jupiter."
  },
  {
    "title": "Asteroid, Martian moons",
    "content": "It is unclear whether Martian moons Phobos and Deimos are captured asteroids or were formed due to impact event on Mars. Phobos and Deimos both have much in common with carbonaceous C-type asteroids, with spectra, albedo, and density very similar to those of C- or D-type asteroids. Based on their similarity, one hypothesis is that both moons may be captured main-belt asteroids. Both moons have very circular orbits which lie almost exactly in Mars's equatorial plane, and hence a capture origin requires a mechanism for circularizing the initially highly eccentric orbit, and adjusting its inclination into the equatorial plane, most probably by a combination of atmospheric drag and tidal forces, although it is not clear whether sufficient time was available for this to occur for Deimos.\nPhobos could be a second-generation Solar System object that coalesced in orbit after Mars formed, rather than forming concurrently out of the same birth cloud as Mars.\nAnother hypothesis is that Mars was once surrounded by many Phobos- and Deimos-sized bodies, perhaps ejected into orbit around it by a collision with a large planetesimal. The high porosity of the interior of Phobos is inconsistent with an asteroidal origin. Observations of Phobos in the thermal infrared suggest a composition containing mainly phyllosilicates, which are well known from the surface of Mars. The spectra are distinct from those of all classes of chondrite meteorites, again pointing away from an asteroidal origin. Both sets of findings support an origin of Phobos from material ejected by an impact on Mars that reaccreted in Martian orbit, similar to the prevailing theory for the origin of Earth's moon."
  },
  {
    "title": "Asteroid, Size distribution",
    "content": "Asteroids vary greatly in size, from almost for the largest down to rocks just 1 meter across, below which an object is classified as a meteoroid. The three largest are very much like miniature planets: they are roughly spherical, have at least partly differentiated interiors, and are thought to be surviving protoplanets. The vast majority, however, are much smaller and are irregularly shaped; they are thought to be either battered planetesimals or fragments of larger bodies.\nThe dwarf planet Ceres is by far the largest asteroid, with a diameter of 940 km. The next largest are 4 Vesta and 2 Pallas, both with diameters of just over 500 km. Vesta is the brightest of the four main-belt asteroids that can, on occasion, be visible to the naked eye. On some rare occasions, a near-Earth asteroid may briefly become visible without technical aid; see 99942 Apophis.\nThe mass of all the objects of the asteroid belt, lying between the orbits of Mars and Jupiter, is estimated to be, ~ 3.25% of the mass of the Moon. Of this, Ceres comprises, about 40% of the total. Adding in the next three most massive objects, Vesta, Pallas, and Hygiea, brings this figure up to a bit over 60%, whereas the next seven most-massive asteroids bring the total up to 70%. The number of asteroids increases rapidly as their individual masses decrease.\nThe number of asteroids decreases markedly with increasing size. Although the size distribution generally follows a power law, there are 'bumps' at about and, where more asteroids than expected from such a curve are found. Most asteroids larger than approximately 120 km in diameter are primordial, whereas most smaller asteroids are products of fragmentation of primordial asteroids. The primordial population of the main belt was probably 200 times what it is today."
  },
  {
    "title": "Asteroid, Largest asteroids",
    "content": "\t\nThree largest objects in the asteroid belt, Ceres, Vesta, and Pallas, are intact protoplanets that share many characteristics common to planets, and are atypical compared to the majority of irregularly shaped asteroids. The fourth-largest asteroid, Hygiea, appears nearly spherical although it may have an undifferentiated interior, like the majority of asteroids. The four largest asteroids constitute half the mass of the asteroid belt.\nCeres is the only asteroid that appears to have a plastic shape under its own gravity and hence the only one that is a dwarf planet. It has a much higher absolute magnitude than the other asteroids, of around 3.32, and may possess a surface layer of ice. Like the planets, Ceres is differentiated: it has a crust, a mantle and a core. No meteorites from Ceres have been found on Earth.\nPallas is unusual in that, like Uranus, it rotates on its side, with its axis of rotation tilted at high angles to its orbital plane. Its composition is similar to that of Ceres: high in carbon and silicon, and perhaps partially differentiated. Pallas is the parent body of the Palladian family of asteroids.\nHygiea is the largest carbonaceous asteroid and, unlike the other largest asteroids, lies relatively close to the plane of the ecliptic. It is the largest member and presumed parent body of the Hygiean family of asteroids. Because there is no sufficiently large crater on the surface to be the source of that family, as there is on Vesta, it is thought that Hygiea may have been completely disrupted in the collision that formed the Hygiean family and recoalesced after losing a bit less than 2% of its mass. Observations taken with the Very Large Telescope's SPHERE imager in 2017 and 2018, revealed that Hygiea has a nearly spherical shape, which is consistent both with it being in hydrostatic equilibrium, or formerly being in hydrostatic equilibrium, or with being disrupted and recoalescing.\nInternal differentiation of large asteroids is possibly related to their lack of natural satellites, as satellites of main belt asteroids are mostly believed to form from collisional disruption, creating a rubble pile structure."
  },
  {
    "title": "Asteroid, Rotation",
    "content": "Measurements of the rotation rates of large asteroids in the asteroid belt show that there is an upper limit. Very few asteroids with a diameter larger than 100 meters have a rotation period less than 2.2 hours. For asteroids rotating faster than approximately this rate, the inertial force at the surface is greater than the gravitational force, so any loose surface material would be flung out. However, a solid object should be able to rotate much more rapidly. This suggests that most asteroids with a diameter over 100 meters are rubble piles formed through the accumulation of debris after collisions between asteroids."
  },
  {
    "title": "Asteroid, Surface features",
    "content": "Except for the \"big four\", asteroids are likely to be broadly similar in appearance, if irregular in shape. 50 km 253 Mathilde is a rubble pile saturated with craters with diameters the size of the asteroid's radius. Earth-based observations of 300 km 511 Davida, one of the largest asteroids after the big four, reveal a similarly angular profile, suggesting it is also saturated with radius-size craters. Medium-sized asteroids such as Mathilde and 243 Ida, that have been observed up close, also reveal a deep regolith covering the surface. Of the big four, Pallas and Hygiea are practically unknown. Vesta has compression fractures encircling a radius-size crater at its south pole but is otherwise a spheroid.\n\"Dawn spacecraft\" revealed that Ceres has a heavily cratered surface, but with fewer large craters than expected. Models based on the formation of the current asteroid belt had suggested Ceres should possess 10 to 15 craters larger than 400 km in diameter. The most likely reason for this is viscous relaxation of the crust slowly flattening out larger impacts."
  },
  {
    "title": "Asteroid, Composition",
    "content": "Asteroids are classified by their characteristic emission spectra, with the majority falling into three main groups: C-type, M-type, and S-type. These were named after and are generally identified with carbonaceous, metallic, and silicaceous compositions, respectively. The physical composition of asteroids is varied and in most cases poorly understood. Ceres appears to be composed of a rocky core covered by an icy mantle, where Vesta is thought to have a nickel-iron core, olivine mantle, and basaltic crust. Thought to be the largest undifferentiated asteroid, 10 Hygiea seems to have a uniformly primitive composition of carbonaceous chondrite, but it may actually be a differentiated asteroid that was globally disrupted by an impact and then reassembled. Other asteroids appear to be the remnant cores or mantles of proto-planets, high in rock and metal. Most small asteroids are believed to be piles of rubble held together loosely by gravity, although the largest are probably solid. Some asteroids have moons or are co-orbiting binaries: rubble piles, moons, binaries, and scattered asteroid families are thought to be the results of collisions that disrupted a parent asteroid, or possibly a planet.\nIn the main asteroid belt, there appear to be two primary populations of asteroid: a dark, volatile-rich population, consisting of the C-type and P-type asteroids, with albedos less than 0.10 and densities under, and a dense, volatile-poor population, consisting of the S-type and M-type asteroids, with albedos over 0.15 and densities greater than 2.7. Within these populations, larger asteroids are denser, presumably due to compression. There appears to be minimal macro-porosity in the score of asteroids with masses greater than.\nComposition is calculated from three primary sources: albedo, surface spectrum, and density. The last can only be determined accurately by observing the orbits of moons the asteroid might have. So far, every asteroid with moons has turned out to be a rubble pile, a loose conglomeration of rock and metal that may be half empty space by volume. The investigated asteroids are as large as 280 km in diameter, and include 121 Hermione, and 87 Sylvia. Few asteroids are larger than 87 Sylvia, none of them have moons. The fact that such large asteroids as Sylvia may be rubble piles, presumably due to disruptive impacts, has important consequences for the formation of the Solar System: computer simulations of collisions involving solid bodies show them destroying each other as often as merging, but colliding rubble piles are more likely to merge. This means that the cores of the planets could have formed relatively quickly."
  },
  {
    "title": "Asteroid, Water",
    "content": "Scientists hypothesize that some of the first water brought to Earth was delivered by asteroid impacts after the collision that produced the Moon. In 2009, the presence of water ice was confirmed on the surface of 24 Themis using NASA's Infrared Telescope Facility. The surface of the asteroid appears completely covered in ice. As this ice layer is sublimating, it may be getting replenished by a reservoir of ice under the surface. Organic compounds were also detected on the surface. The presence of ice on 24 Themis makes the initial theory plausible.\nIn October 2013, water was detected on an extrasolar body for the first time, on an asteroid orbiting the white dwarf GD 61. On 22 January 2014, European Space Agency scientists reported the detection, for the first definitive time, of water vapor on Ceres, the largest object in the asteroid belt. The detection was made by using the far-infrared abilities of the Herschel Space Observatory. The finding is unexpected because comets, not asteroids, are typically considered to \"sprout jets and plumes\". According to one of the scientists, \"The lines are becoming more and more blurred between comets and asteroids.\"\nFindings have shown that solar winds can react with the oxygen in the upper layer of the asteroids and create water. It has been estimated that \"every cubic metre of irradiated rock could contain up to 20 litres\"; study was conducted using an atom probe tomography, numbers are given for the Itokawa S-type asteroid.\nAcfer 049, a meteorite discovered in Algeria in 1990, was shown in 2019 to have an ultraporous lithology: porous texture that could be formed by removal of ice that filled these pores, this suggests that UPL \"represent fossils of primordial ice\"."
  },
  {
    "title": "Asteroid, Organic compounds",
    "content": "Asteroids contain traces of amino acids and other organic compounds, and some speculate that asteroid impacts may have seeded the early Earth with the chemicals necessary to initiate life, or may have even brought life itself to Earth. In August 2011, a report, based on NASA studies with meteorites found on Earth, was published suggesting DNA and RNA components may have been formed on asteroids and comets in outer space.\nIn November 2019, scientists reported detecting, for the first time, sugar molecules, including ribose, in meteorites, suggesting that chemical processes on asteroids can produce some fundamentally essential bio-ingredients important to life, and supporting the notion of an RNA world prior to a DNA-based origin of life on Earth, and possibly, as well, the notion of panspermia."
  },
  {
    "title": "Asteroid, Orbital classification",
    "content": "Many asteroids have been placed in groups and families based on their orbital characteristics. Apart from the broadest divisions, it is customary to name a group of asteroids after the first member of that group to be discovered. Groups are relatively loose dynamical associations, whereas families are tighter and result from the catastrophic break-up of a large parent asteroid sometime in the past. Families are more common and easier to identify within the main asteroid belt, but several small families have been reported among the Jupiter trojans. Main belt families were first recognized by Kiyotsugu Hirayama in 1918 and are often called Hirayama families in his honor.\nAbout 30-35% of the bodies in the asteroid belt belong to dynamical families, each thought to have a common origin in a past collision between asteroids. A family has also been associated with the plutoid dwarf planet.\nSome asteroids have unusual horseshoe orbits that are co-orbital with Earth or another planet. Examples are 3753 Cruithne and. The first instance of this type of orbital arrangement was discovered between Saturn's moons Epimetheus and Janus. Sometimes these horseshoe objects temporarily become quasi-satellites for a few decades or a few hundred years, before returning to their earlier status. Both Earth and Venus are known to have quasi-satellites.\nSuch objects, if associated with Earth or Venus or even hypothetically Mercury, are a special class of Aten asteroids. However, such objects could be associated with the outer planets as well."
  },
  {
    "title": "Asteroid, Spectral classification",
    "content": "In 1975, an asteroid taxonomic system based on color, albedo, and spectral shape was developed by Chapman, Morrison, and Zellner. These properties are thought to correspond to the composition of the asteroid's surface material. The original classification system had three categories: C-types for dark carbonaceous objects, S-types for stony objects and U for those that did not fit into either C or S. This classification has since been expanded to include many other asteroid types. The number of types continues to grow as more asteroids are studied.\nThe two most widely used taxonomies now used are the Tholen classification and SMASS classification. The former was proposed in 1984 by David J. Tholen, and was based on data collected from an eight-color asteroid survey performed in the 1980s. This resulted in 14 asteroid categories. In 2002, the Small Main-Belt Asteroid Spectroscopic Survey resulted in a modified version of the Tholen taxonomy with 24 different types. Both systems have three broad categories of C, S, and X asteroids, where X consists of mostly metallic asteroids, such as the M-type. There are also several smaller classes.\nThe proportion of known asteroids falling into the various spectral types does not necessarily reflect the proportion of all asteroids that are of that type; some types are easier to detect than others, biasing the totals."
  },
  {
    "title": "Asteroid, Problems",
    "content": "Originally, spectral designations were based on inferences of an asteroid's composition. However, the correspondence between spectral class and composition is not always very good, and a variety of classifications are in use. This has led to significant confusion. Although asteroids of different spectral classifications are likely to be composed of different materials, there are no assurances that asteroids within the same taxonomic class are composed of the same materials."
  },
  {
    "title": "Asteroid, Active asteroids",
    "content": "Active asteroids are objects that have asteroid-like orbits but show comet-like visual characteristics. That is, they show comae, tails, or other visual evidence of mass-loss, but their orbit remains within Jupiter's orbit. These bodies were originally designated main-belt comets in 2006 by astronomers David Jewitt and Henry Hsieh, but this name implies they are necessarily icy in composition like a comet and that they only exist within the main-belt, whereas the growing population of active asteroids shows that this is not always the case.\nThe first active asteroid discovered is 7968 Elst-Pizarro. It was discovered in 1979 but then was found to have a tail by Eric Elst and Guido Pizarro in 1996 and given the cometary designation 133P/Elst-Pizarro. Another notable object is 311P/PanSTARRS: observations made by the Hubble Space Telescope revealed that it had six comet-like tails. The tails are suspected to be streams of material ejected by the asteroid as a result of a rubble pile asteroid spinning fast enough to remove material from it.\nBy smashing into the asteroid Dimorphos, NASA's Double Asteroid Redirection Test spacecraft made it an active asteroid. Scientists had proposed that some active asteroids are the result of impact events, but no one had ever observed the activation of an asteroid. The DART mission activated Dimorphos under precisely known and carefully observed impact conditions, enabling the detailed study of the formation of an active asteroid for the first time. Observations show that Dimorphos lost approximately 1 million kilograms after the collision. Impact produced a dust plume that temporarily brightened the Didymos system and developed a 10000 km-long dust tail that persisted for several months."
  },
  {
    "title": "Asteroid, Exploration",
    "content": "Until the age of space travel, objects in the asteroid belt could only be observed with large telescopes, their shapes and terrain remaining a mystery. The best modern ground-based telescopes and the Earth-orbiting Hubble Space Telescope can only resolve a small amount of detail on the surfaces of the largest asteroids. Limited information about the shapes and compositions of asteroids can be inferred from their light curves and their spectral properties. Sizes can be estimated by timing the lengths of star occultations. Radar imaging can yield good information about asteroid shapes and orbital and rotational parameters, especially for near-Earth asteroids. Spacecraft flybys can provide much more data than any ground or space-based observations; sample-return missions gives insights about regolith composition."
  },
  {
    "title": "Asteroid, Ground-based observations",
    "content": "Mid- to thermal-infrared observations, along with polarimetry measurements, are probably the only data that give some indication of actual physical properties. Measuring the heat flux of an asteroid at a single wavelength gives an estimate of the dimensions of the object; these measurements have lower uncertainty than measurements of the reflected sunlight in the visible-light spectral region. If the two measurements can be combined, both the effective diameter and the geometric albedo-the latter being a measure of the brightness at zero phase angle, that is, when illumination comes from directly behind the observer-can be derived. In addition, thermal measurements at two or more wavelengths, plus the brightness in the visible-light region, give information on the thermal properties. The thermal inertia, which is a measure of how fast a material heats up or cools off, of most observed asteroids is lower than the bare-rock reference value but greater than that of the lunar regolith; this observation indicates the presence of an insulating layer of granular material on their surface. Moreover, there seems to be a trend, perhaps related to the gravitational environment, that smaller objects have a small regolith layer consisting of coarse grains, while larger objects have a thicker regolith layer consisting of fine grains. However, the detailed properties of this regolith layer are poorly known from remote observations. Moreover, the relation between thermal inertia and surface roughness is not straightforward, so one needs to interpret the thermal inertia with caution.\nNear-Earth asteroids that come into close vicinity of the planet can be studied in more details with radar; it provides information about the surface of the asteroid. Such observations were conducted by the Arecibo Observatory in Puerto Rico and Goldstone Observatory in California. Radar observations can also be used for accurate determination of the orbital and rotational dynamics of observed objects."
  },
  {
    "title": "Asteroid, Space-based observations",
    "content": "Both space and ground-based observatories conducted asteroid search programs; the space-based searches are expected to detect more objects because there is no atmosphere to interfere and because they can observe larger portions of the sky. NEOWISE observed more than 100,000 asteroids of the main belt, Spitzer Space Telescope observed more than 700 near-Earth asteroids. These observations determined rough sizes of the majority of observed objects, but provided limited detail about surface properties.\nAsteroids were also studied by the Hubble Space Telescope, such as tracking the colliding asteroids in the main belt, break-up of an asteroid, observing an active asteroid with six comet-like tails, and observing asteroids that were chosen as targets of dedicated missions."
  },
  {
    "title": "Asteroid, Space probe missions",
    "content": "The internal structure of asteroids is inferred only from indirect evidence: bulk densities measured by spacecraft, the orbits of natural satellites in the case of asteroid binaries, and the drift of an asteroid's orbit due to the Yarkovsky thermal effect. A spacecraft near an asteroid is perturbed enough by the asteroid's gravity to allow an estimate of the asteroid's mass. The volume is then estimated using a model of the asteroid's shape. Mass and volume allow the derivation of the bulk density, whose uncertainty is usually dominated by the errors made on the volume estimate. The internal porosity of asteroids can be inferred by comparing their bulk density with that of their assumed meteorite analogues, dark asteroids seem to be more porous than bright ones. The nature of this porosity is unclear."
  },
  {
    "title": "Asteroid, Dedicated missions",
    "content": "The first asteroid to be photographed in close-up was 951 Gaspra in 1991, followed in 1993 by 243 Ida and its moon Dactyl, all of which were imaged by the \"Galileo\" probe en route to Jupiter. Other asteroids briefly visited by spacecraft en route to other destinations include 9969 Braille, 5535 Annefrank, 2867 Šteins and 21 Lutetia, and 4179 Toutatis.\nThe first dedicated asteroid probe was NASA's \"NEAR Shoemaker\", which photographed 253 Mathilde in 1997, before entering into orbit around 433 Eros, finally landing on its surface in 2001. It was the first spacecraft to successfully orbit and land on an asteroid. From September to November 2005, the Japanese \"Hayabusa\" probe studied 25143 Itokawa in detail and returned samples of its surface to Earth on 13 June 2010, the first asteroid sample-return mission. In 2007, NASA launched the \"Dawn\" spacecraft, which orbited 4 Vesta for a year, and observed the dwarf planet Ceres for three years.\n\"Hayabusa2\", a probe launched by JAXA 2014, orbited its target asteroid 162173 Ryugu for more than a year and took samples that were delivered to Earth in 2020. The spacecraft is now on an extended mission and expected to arrive at a new target in 2031.\nNASA launched the OSIRIS-REx in 2016, a sample return mission to asteroid 101955 Bennu. In 2021, the probe departed the asteroid with a sample from its surface. Sample delivery to Earth is expected on September 24, 2023. The spacecraft will continue on an extended mission, designated OSIRIS-APEX, to explore near-Earth asteroid Apophis in 2029.\nIn 2021, NASA launched Double Asteroid Redirection Test, a mission to test technology for defending Earth against potential hazardous objects. DART deliberately crashed into the minor-planet moon Dimorphos of the double asteroid Didymos in September 2022 to assess the potential of a spacecraft impact to deflect an asteroid from a collision course with Earth. In October, NASA declared DART a success, confirming it had shortened Dimorphos' orbital period around Didymos by about 32 minutes."
  },
  {
    "title": "Asteroid, Planned missions",
    "content": "Currently, several asteroid-dedicated missions are planned by NASA, JAXA, ESA, and CNSA.\nNASA's \"Lucy\", launched in 2021, would visit eight asteroids, one from the main belt and seven Jupiter trojans; it is the first mission to trojans. The main mission would start in 2027.\nESA's \"Hera\", planned for launch in 2024, will study the results of the DART impact. It will measure the size and morphology of the crater, and momentum transmitted by the impact, to determine the efficiency of the deflection produced by DART.\nNASA's \"Psyche\" would be launched in 2023 or 2024 to study the large metallic asteroid of the same name.\nJAXA's DESTINY+ is a mission for a flyby of the Geminids meteor shower parent body 3200 Phaethon, as well as various minor bodies. Its launch is planned for 2024.\nCNSA's \"Tianwen-2\" is planned to launch in 2025. It will use solar electric propulsion to explore the co-orbital near-Earth asteroid 469219 Kamo'oalewa and the active asteroid 311P/PanSTARRS. The spacecraft will collect samples of the regolith of Kamo'oalewa."
  },
  {
    "title": "Asteroid, Asteroid mining",
    "content": "The concept of asteroid mining was proposed in 1970s. Matt Anderson defines successful asteroid mining as \"the development of a mining program that is both financially self-sustaining and profitable to its investors\". It has been suggested that asteroids might be used as a source of materials that may be rare or exhausted on Earth, or materials for constructing space habitats. Materials that are heavy and expensive to launch from Earth may someday be mined from asteroids and used for space manufacturing and construction.\nAs resource depletion on Earth becomes more real, the idea of extracting valuable elements from asteroids and returning these to Earth for profit, or using space-based resources to build solar-power satellites and space habitats, becomes more attractive. Hypothetically, water processed from ice could refuel orbiting propellant depots.\nFrom the astrobiological perspective, asteroid prospecting could provide scientific data for the search for extraterrestrial intelligence. Some astrophysicists have suggested that if advanced extraterrestrial civilizations employed asteroid mining long ago, the hallmarks of these activities might be detectable.\nMining Ceres is also considered a possibility. As the largest body in the asteroid belt, Ceres could become the main base and transport hub for future asteroid mining infrastructure, allowing mineral resources to be transported to Mars, the Moon, and Earth. Because of its small escape velocity combined with large amounts of water ice, it also could serve as a source of water, fuel, and oxygen for ships going through and beyond the asteroid belt."
  },
  {
    "title": "Asteroid, Threats to Earth",
    "content": "There is increasing interest in identifying asteroids whose orbits cross Earth's, and that could, given enough time, collide with Earth. The three most important groups of near-Earth asteroids are the Apollos, Amors, and Atens.\nThe near-Earth asteroid 433 Eros had been discovered as long ago as 1898, and the 1930s brought a flurry of similar objects. In order of discovery, these were: 1221 Amor, 1862 Apollo, 2101 Adonis, and finally 69230 Hermes, which approached within 0.005 AU of Earth in 1937. Astronomers began to realize the possibilities of Earth impact.\nTwo events in later decades increased the alarm: the increasing acceptance of the Alvarez hypothesis that an impact event resulted in the Cretaceous-Paleogene extinction, and the 1994 observation of Comet Shoemaker-Levy 9 crashing into Jupiter. The U.S. military also declassified the information that its military satellites, built to detect nuclear explosions, had detected hundreds of upper-atmosphere impacts by objects ranging from one to ten meters across.\nAll of these considerations helped spur the launch of highly efficient surveys, consisting of charge-coupled device cameras and computers directly connected to telescopes., it was estimated that 89% to 96% of near-Earth asteroids one kilometer or larger in diameter had been discovered.\nIn April 2018, the B612 Foundation reported \"It is 100 percent certain we'll be hit by a devastating asteroid, but we're not 100 percent sure when\". In June 2018, the US National Science and Technology Council warned that America is unprepared for an asteroid impact event, and has developed and released the \"National Near-Earth Object Preparedness Strategy Action Plan\" to better prepare. According to expert testimony in the United States Congress in 2013, NASA would require at least five years of preparation before a mission to intercept an asteroid could be launched.\nThe United Nations declared 30 June as International Asteroid Day to educate the public about asteroids. The date of International Asteroid Day commemorates the anniversary of the Tunguska asteroid impact over Siberia, on 30 June 1908."
  },
  {
    "title": "Asteroid, Chicxulub impact",
    "content": "The Chicxulub crater is an impact crater buried underneath the Yucatán Peninsula in Mexico. Its center is offshore near the communities of Chicxulub Puerto and Chicxulub Pueblo, after which the crater is named. It was formed when a large asteroid, about 10 km in diameter, struck the Earth. The crater is estimated to be 180 km in diameter and 20 km in depth. It is one of the largest confirmed impact structures on Earth, and the only one whose peak ring is intact and directly accessible for scientific research.\nIn the late 1970s, geologist Walter Alvarez and his father, Nobel Prize-winning scientist Luis Walter Alvarez, put forth their theory that the Cretaceous-Paleogene extinction was caused by an impact event. The main evidence of such an impact was contained in a thin layer of clay present in the K-Pg boundary in Gubbio, Italy. The Alvarezes and colleagues reported that it contained an abnormally high concentration of iridium, a chemical element rare on earth but common in asteroids. Iridium levels in this layer were as much as 160 times above the background level. At the time, consensus was not settled on what caused the Cretaceous-Paleogene extinction and the boundary layer, with theories including a nearby supernova, climate change, or a geomagnetic reversal. The Alvarezes' impact hypothesis was rejected by many paleontologists, who believed that the lack of fossils found close to the K-Pg boundary-the \"three-meter problem\"-suggested a more gradual die-off of fossil species."
  },
  {
    "title": "Asteroid, Asteroid deflection strategies",
    "content": "Various collision avoidance techniques have different trade-offs with respect to metrics such as overall performance, cost, failure risks, operations, and technology readiness. There are various methods for changing the course of an asteroid/comet. These can be differentiated by various types of attributes such as the type of mitigation, energy source, and approach strategy.\nStrategies fall into two basic sets: fragmentation and delay. Fragmentation concentrates on rendering the impactor harmless by fragmenting it and scattering the fragments so that they miss the Earth or are small enough to burn up in the atmosphere. Delay exploits the fact that both the Earth and the impactor are in orbit. An impact occurs when both reach the same point in space at the same time, or more correctly when some point on Earth's surface intersects the impactor's orbit when the impactor arrives. Since the Earth is approximately 12,750 km in diameter and moves at approx. 30 km per second in its orbit, it travels a distance of one planetary diameter in about 425 seconds, or slightly over seven minutes. Delaying, or advancing the impactor's arrival by times of this magnitude can, depending on the exact geometry of the impact, cause it to miss the Earth.\n\"Project Icarus\" was one of the first projects designed in 1967 as a contingency plan in case of collision with 1566 Icarus. The plan relied on the new Saturn V rocket, which did not make its first flight until after the report had been completed. Six Saturn V rockets would be used, each launched at variable intervals from months to hours away from impact. Each rocket was to be fitted with a single 100-megaton nuclear warhead as well as a modified Apollo Service Module and uncrewed Apollo Command Module for guidance to the target. The warheads would be detonated 30 meters from the surface, deflecting or partially destroying the asteroid. Depending on the subsequent impacts on the course or the destruction of the asteroid, later missions would be modified or cancelled as needed. The \"last-ditch\" launch of the sixth rocket would be 18 hours prior to impact."
  },
  {
    "title": "Asteroid, Fiction",
    "content": "Asteroids and the asteroid belt are a staple of science fiction stories. Asteroids play several potential roles in science fiction: as places human beings might colonize, resources for extracting minerals, hazards encountered by spacecraft traveling between two other points, and as a threat to life on Earth or other inhabited planets, dwarf planets, and natural satellites by potential impact."
  }
]