All Courses - Page 45 of 326 - Analytics Campus

Posit AI Weblog: Coaching ImageNet with R

Artificial Intelligence

January 25, 2026

Posit AI Weblog: Coaching ImageNet with R

ImageNet (Deng et al. 2009) is a picture database organized in accordance with the WordNet (Miller 1995) hierarchy which, traditionally, has been utilized in pc imaginative and prescient benchmarks and analysis. Nonetheless, it was not till AlexNet (Krizhevsky, Sutskever, and Hinton 2012) demonstrated the effectivity of deep studying utilizing convolutional neural networks on GPUs that the computer-vision self-discipline turned to deep studying to realize state-of-the-art fashions that revolutionized their area. Given the significance of ImageNet and AlexNet, this put up introduces instruments and strategies to think about when coaching ImageNet and different large-scale datasets with R.

Now, so as to course of ImageNet, we are going to first must divide and conquer, partitioning the dataset into a number of manageable subsets. Afterwards, we are going to practice ImageNet utilizing AlexNet throughout a number of GPUs and compute situations. Preprocessing ImageNet and distributed coaching are the 2 subjects that this put up will current and focus on, beginning with preprocessing ImageNet.

Preprocessing ImageNet

When coping with massive datasets, even easy duties like downloading or studying a dataset might be a lot more durable than what you’ll anticipate. As an example, since ImageNet is roughly 300GB in measurement, you will have to verify to have a minimum of 600GB of free house to go away some room for obtain and decompression. However no worries, you may at all times borrow computer systems with large disk drives out of your favourite cloud supplier. When you are at it, you also needs to request compute situations with a number of GPUs, Strong State Drives (SSDs), and an inexpensive quantity of CPUs and reminiscence. If you wish to use the precise configuration we used, check out the mlverse/imagenet repo, which comprises a Docker picture and configuration instructions required to provision affordable computing sources for this process. In abstract, be sure you have entry to adequate compute sources.

Now that we’ve got sources able to working with ImageNet, we have to discover a place to obtain ImageNet from. The simplest means is to make use of a variation of ImageNet used within the ImageNet Giant Scale Visible Recognition Problem (ILSVRC), which comprises a subset of about 250GB of knowledge and might be simply downloaded from many Kaggle competitions, just like the ImageNet Object Localization Problem.

For those who’ve learn a few of our earlier posts, you may be already pondering of utilizing the pins bundle, which you should use to: cache, uncover and share sources from many providers, together with Kaggle. You’ll be able to study extra about information retrieval from Kaggle within the Utilizing Kaggle Boards article; within the meantime, let’s assume you might be already conversant in this bundle.

All we have to do now’s register the Kaggle board, retrieve ImageNet as a pin, and decompress this file. Warning, the next code requires you to stare at a progress bar for, probably, over an hour.

library(pins)
board_register("kaggle", token = "kaggle.json")

pin_get("c/imagenet-object-localization-challenge", board = "kaggle")[1] %>%
  untar(exdir = "/localssd/imagenet/")

If we’re going to be coaching this mannequin time and again utilizing a number of GPUs and even a number of compute situations, we wish to be sure that we don’t waste an excessive amount of time downloading ImageNet each single time.

The primary enchancment to think about is getting a quicker exhausting drive. In our case, we locally-mounted an array of SSDs into the /localssd path. We then used /localssd to extract ImageNet and configured R’s temp path and pins cache to make use of the SSDs as effectively. Seek the advice of your cloud supplier’s documentation to configure SSDs, or check out mlverse/imagenet.

Subsequent, a well known method we will comply with is to partition ImageNet into chunks that may be individually downloaded to carry out distributed coaching afterward.

As well as, it is usually quicker to obtain ImageNet from a close-by location, ideally from a URL saved inside the similar information heart the place our cloud occasion is positioned. For this, we will additionally use pins to register a board with our cloud supplier after which re-upload every partition. Since ImageNet is already partitioned by class, we will simply break up ImageNet into a number of zip recordsdata and re-upload to our closest information heart as follows. Make sure that the storage bucket is created in the identical area as your computing situations.

board_register("", identify = "imagenet", bucket = "r-imagenet")

train_path <- "/localssd/imagenet/ILSVRC/Knowledge/CLS-LOC/practice/"
for (path in dir(train_path, full.names = TRUE)) {
  dir(path, full.names = TRUE) %>%
    pin(identify = basename(path), board = "imagenet", zip = TRUE)
}

We are able to now retrieve a subset of ImageNet fairly effectively. If you’re motivated to take action and have about one gigabyte to spare, be at liberty to comply with alongside executing this code. Discover that ImageNet comprises heaps of JPEG pictures for every WordNet class.

board_register("https://storage.googleapis.com/r-imagenet/", "imagenet")

classes <- pin_get("classes", board = "imagenet")
pin_get(classes$id[1], board = "imagenet", extract = TRUE) %>%
  tibble::as_tibble()

# A tibble: 1,300 x 1
   worth                                                           
                                                              
 1 /localssd/pins/storage/n01440764/n01440764_10026.JPEG
 2 /localssd/pins/storage/n01440764/n01440764_10027.JPEG
 3 /localssd/pins/storage/n01440764/n01440764_10029.JPEG
 4 /localssd/pins/storage/n01440764/n01440764_10040.JPEG
 5 /localssd/pins/storage/n01440764/n01440764_10042.JPEG
 6 /localssd/pins/storage/n01440764/n01440764_10043.JPEG
 7 /localssd/pins/storage/n01440764/n01440764_10048.JPEG
 8 /localssd/pins/storage/n01440764/n01440764_10066.JPEG
 9 /localssd/pins/storage/n01440764/n01440764_10074.JPEG
10 /localssd/pins/storage/n01440764/n01440764_1009.JPEG 
# … with 1,290 extra rows

When doing distributed coaching over ImageNet, we will now let a single compute occasion course of a partition of ImageNet with ease. Say, 1/16 of ImageNet might be retrieved and extracted, in underneath a minute, utilizing parallel downloads with the callr bundle:

classes <- pin_get("classes", board = "imagenet")
classes <- classes$id[1:(length(categories$id) / 16)]

procs <- lapply(classes, perform(cat)
  callr::r_bg(perform(cat) {
    library(pins)
    board_register("https://storage.googleapis.com/r-imagenet/", "imagenet")
    
    pin_get(cat, board = "imagenet", extract = TRUE)
  }, args = checklist(cat))
)
  
whereas (any(sapply(procs, perform(p) p$is_alive()))) Sys.sleep(1)

We are able to wrap this up partition in an inventory containing a map of pictures and classes, which we are going to later use in our AlexNet mannequin by means of tfdatasets.

information <- checklist(
    picture = unlist(lapply(classes, perform(cat) {
        pin_get(cat, board = "imagenet", obtain = FALSE)
    })),
    class = unlist(lapply(classes, perform(cat) {
        rep(cat, size(pin_get(cat, board = "imagenet", obtain = FALSE)))
    })),
    classes = classes
)

Nice! We’re midway there coaching ImageNet. The following part will concentrate on introducing distributed coaching utilizing a number of GPUs.

Distributed Coaching

Now that we’ve got damaged down ImageNet into manageable components, we will neglect for a second concerning the measurement of ImageNet and concentrate on coaching a deep studying mannequin for this dataset. Nonetheless, any mannequin we select is prone to require a GPU, even for a 1/16 subset of ImageNet. So be sure that your GPUs are correctly configured by working is_gpu_available(). For those who need assistance getting a GPU configured, the Utilizing GPUs with TensorFlow and Docker video might help you rise up to hurry.

[1] TRUE

We are able to now determine which deep studying mannequin would finest be fitted to ImageNet classification duties. As a substitute, for this put up, we are going to return in time to the glory days of AlexNet and use the r-tensorflow/alexnet repo as a substitute. This repo comprises a port of AlexNet to R, however please discover that this port has not been examined and isn’t prepared for any actual use circumstances. In truth, we might admire PRs to enhance it if somebody feels inclined to take action. Regardless, the main focus of this put up is on workflows and instruments, not about attaining state-of-the-art picture classification scores. So by all means, be at liberty to make use of extra acceptable fashions.

As soon as we’ve chosen a mannequin, we are going to wish to me ensure that it correctly trains on a subset of ImageNet:

remotes::install_github("r-tensorflow/alexnet")
alexnet::alexnet_train(information = information)

Epoch 1/2
 103/2269 [>...............] - ETA: 5:52 - loss: 72306.4531 - accuracy: 0.9748

Up to now so good! Nonetheless, this put up is about enabling large-scale coaching throughout a number of GPUs, so we wish to be sure that we’re utilizing as many as we will. Sadly, working nvidia-smi will present that just one GPU at the moment getting used:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Model: 418.152.00   CUDA Model: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Identify        Persistence-M| Bus-Id        Disp.A | Unstable Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Utilization/Cap|         Reminiscence-Utilization | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   48C    P0    89W / 149W |  10935MiB / 11441MiB |     28%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:06.0 Off |                    0 |
| N/A   74C    P0    74W / 149W |     71MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Reminiscence |
|  GPU       PID   Kind   Course of identify                             Utilization      |
|=============================================================================|
+-----------------------------------------------------------------------------+

With a purpose to practice throughout a number of GPUs, we have to outline a distributed-processing technique. If this can be a new idea, it may be time to check out the Distributed Coaching with Keras tutorial and the distributed coaching with TensorFlow docs. Or, when you enable us to oversimplify the method, all it’s a must to do is outline and compile your mannequin underneath the suitable scope. A step-by-step clarification is obtainable within the Distributed Deep Studying with TensorFlow and R video. On this case, the alexnet mannequin already helps a technique parameter, so all we’ve got to do is cross it alongside.

library(tensorflow)
technique <- tf$distribute$MirroredStrategy(
  cross_device_ops = tf$distribute$ReductionToOneDevice())

alexnet::alexnet_train(information = information, technique = technique, parallel = 6)

Discover additionally parallel = 6 which configures tfdatasets to utilize a number of CPUs when loading information into our GPUs, see Parallel Mapping for particulars.

We are able to now re-run nvidia-smi to validate all our GPUs are getting used:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Model: 418.152.00   CUDA Model: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Identify        Persistence-M| Bus-Id        Disp.A | Unstable Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Utilization/Cap|         Reminiscence-Utilization | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   49C    P0    94W / 149W |  10936MiB / 11441MiB |     53%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:06.0 Off |                    0 |
| N/A   76C    P0   114W / 149W |  10936MiB / 11441MiB |     26%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Reminiscence |
|  GPU       PID   Kind   Course of identify                             Utilization      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The MirroredStrategy might help us scale as much as about 8 GPUs per compute occasion; nonetheless, we’re prone to want 16 situations with 8 GPUs every to coach ImageNet in an inexpensive time (see Jeremy Howard’s put up on Coaching Imagenet in 18 Minutes). So the place can we go from right here?

Welcome to MultiWorkerMirroredStrategy: This technique can use not solely a number of GPUs, but additionally a number of GPUs throughout a number of computer systems. To configure them, all we’ve got to do is outline a TF_CONFIG setting variable with the suitable addresses and run the very same code in every compute occasion.

library(tensorflow)

partition <- 0
Sys.setenv(TF_CONFIG = jsonlite::toJSON(checklist(
    cluster = checklist(
        employee = c("10.100.10.100:10090", "10.100.10.101:10090")
    ),
    process = checklist(kind = 'employee', index = partition)
), auto_unbox = TRUE))

technique <- tf$distribute$MultiWorkerMirroredStrategy(
  cross_device_ops = tf$distribute$ReductionToOneDevice())

alexnet::imagenet_partition(partition = partition) %>%
  alexnet::alexnet_train(technique = technique, parallel = 6)

Please word that partition should change for every compute occasion to uniquely establish it, and that the IP addresses additionally should be adjusted. As well as, information ought to level to a unique partition of ImageNet, which we will retrieve with pins; though, for comfort, alexnet comprises related code underneath alexnet::imagenet_partition(). Apart from that, the code that it’s worthwhile to run in every compute occasion is precisely the identical.

Nonetheless, if we had been to make use of 16 machines with 8 GPUs every to coach ImageNet, it could be fairly time-consuming and error-prone to manually run code in every R session. So as a substitute, we should always consider making use of cluster-computing frameworks, like Apache Spark with barrier execution. If you’re new to Spark, there are a lot of sources accessible at sparklyr.ai. To study nearly working Spark and TensorFlow collectively, watch our Deep Studying with Spark, TensorFlow and R video.

Placing all of it collectively, coaching ImageNet in R with TensorFlow and Spark appears as follows:

library(sparklyr)
sc <- spark_connect("yarn|mesos|and so forth", config = checklist("sparklyr.shell.num-executors" = 16))

sdf_len(sc, 16, repartition = 16) %>%
  spark_apply(perform(df, barrier) {
      library(tensorflow)

      Sys.setenv(TF_CONFIG = jsonlite::toJSON(checklist(
        cluster = checklist(
          employee = paste(
            gsub(":[0-9]+$", "", barrier$handle),
            8000 + seq_along(barrier$handle), sep = ":")),
        process = checklist(kind = 'employee', index = barrier$partition)
      ), auto_unbox = TRUE))
      
      if (is.null(tf_version())) install_tensorflow()
      
      technique <- tf$distribute$MultiWorkerMirroredStrategy()
    
      consequence <- alexnet::imagenet_partition(partition = barrier$partition) %>%
        alexnet::alexnet_train(technique = technique, epochs = 10, parallel = 6)
      
      consequence$metrics$accuracy
  }, barrier = TRUE, columns = c(accuracy = "numeric"))

We hope this put up gave you an inexpensive overview of what coaching large-datasets in R appears like – thanks for studying alongside!

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “Imagenet: A Giant-Scale Hierarchical Picture Database.” In 2009 IEEE Convention on Pc Imaginative and prescient and Sample Recognition, 248–55. Ieee.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Info Processing Techniques, 1097–1105.

Miller, George A. 1995. “WordNet: A Lexical Database for English.” Communications of the ACM 38 (11): 39–41.

Minneapolis capturing: What to know concerning the demise of Alex Pretti

Technology

Dr. Mike

January 24, 2026

Minneapolis capturing: What to know concerning the demise of Alex Pretti

On Saturday, a Border Patrol agent in Minneapolis shot and killed Alex Jeffrey Pretti at shut vary after Pretti had been pepper-sprayed, crushed, and compelled onto his knees by different brokers.

Pretti, 37, was a US citizen and reportedly within the space to observe brokers’ actions. He was additionally a registered nurse and a authorized gun proprietor with a allow to hold a weapon — one which he was now not in possession of when he was shot to demise.

Pretti’s demise is no less than the third capturing by immigration brokers within the Minneapolis space this 12 months, and the second the place the one who was shot died.

The shootings have understandably attracted essentially the most consideration nationwide. However because the immigration crackdown in Minneapolis started in early January, there have been widespread abuses of energy US by Immigration and Customs Enforcement (ICE) and Customs and Border Safety (CBP) brokers, together with use of chemical crowd management like pepper spray and tear fuel; brutality towards protesters, bystanders, and immigrants; and baseless and infrequently inflammatory arrests and detentions.

On January 7, simply days into an immigration crackdown focusing on the Minneapolis space that Trump officers heralded as “largest immigration operation ever,” an ICE agent, Jonathan Ross, shot and killed Renee Good as she tried to drive away.

The White Home, Homeland Safety Secretary Kristi Noem, and different federal officers rapidly backed Ross to the hilt, describing Good as a home terrorist and describing the capturing as justified, regardless of video proof on the contrary.

Since then, the message behind the administration’s assist for Ross and the capturing appears to have been clearly acquired by ICE brokers in Minnesota, who’ve behaved way more like an occupying power than a legislation enforcement operation: Not solely have native officers pleaded with them to go away the state, they’re additionally working from behind masks and with militarized power, together with tactical gear, riot management brokers, and assault weapons.

Saturday in Minneapolis.

Kerem Yucel/AFP through Getty Photographs

They’ve even pitted themselves towards native police: A Minneapolis-area police chief mentioned this earlier week that a few of his off-duty officers have been harassed and racially profiled by immigration brokers.

In a number of circumstances, federal brokers have been documented utilizing Good’s killing as a risk towards different observers documenting their actions, asking one lady, “Have y’all not realized?” earlier than grabbing her cellphone and detaining her.

What immigration brokers have been doing in Minneapolis

Different incidents are too quite a few to tally in full, however a number of stand out.

Final week, federal brokers violently detained two Goal staff, each of whom a Minnesota state consultant mentioned had been US residents and who had been later launched. At the least one of many staff was left in a close-by parking zone with accidents.

In one other incident, a US citizen was dragged from her automotive by federal brokers after she was stopped on the best way to a health care provider’s appointment; brokers broke the home windows of her automobile and carried her hanging face down by her legs and arms. And federal brokers have been recorded pepper-spraying an already-detained man within the face at shut vary.

ICE brokers detain a girl after pulling her from a automotive on January 13, 2026 in Minneapolis.

Stephen Maturen/Getty Photographs

A Minneapolis household was additionally caught up and brutalized by federal brokers final week: On the best way dwelling from a basketball sport, a household of eight — together with a 6-month-old and 5 different kids — was tear-gassed inside their automobile by federal brokers. All survived, however the 6-month-old required CPR.

The second of three shootings by federal immigration brokers within the Minneapolis space was additionally a case of mistaken identification: ICE brokers shot a Venezuelan man within the leg, wounding him, despite the fact that he was not their unique goal.

Extra just lately, ChongLy “Scott” Thao, additionally a US citizen, was detained in his dwelling at gunpoint by federal brokers and brought away in sub-freezing temperatures sporting solely his underwear, sandals, and a blanket. Thao was arrested with out a warrant and finally launched hours later — with out an apology for his detention or for the injury to his dwelling, Thao mentioned.

A person is pinned to the ground by federal agents and a chemical irritant sprayed directly into his face

An individual is pinned to the bottom by federal brokers and a chemical irritant sprayed immediately into his face on January 21, 2026, in Minneapolis.

Richard Tsong-Taatarii/The Minnesota Star Tribune through Getty Photographs

Thao’s detention is an element of a bigger sample in Minneapolis, the place ICE brokers are more and more appearing in violation of the Fourth Modification, which protects towards unreasonable searches and seizures. As my colleague Eric Levitz wrote on Friday, ICE has determined, in keeping with a intently held inner memo first obtained by the Related Press, that it may possibly enter houses with solely an administrative warrant, fairly than a judicial warrant. Such administrative warrants don’t require a decide’s approval and may be issued by ICE brokers themselves.

ICE’s crackdown has additionally swept up kids within the Minneapolis space, together with an incident this week the place brokers tried to make use of a 5-year-old youngster as “bait” to detain others by having him knock on the door of his dwelling after taking his father into custody, in keeping with officers at a Minneapolis-area faculty district. Additionally they detained a 2-year-old and her father on Thursday and briefly eliminated each of them to Texas.

Native publications just like the Minneapolis Star-Tribune — and bystanders filming interactions, as Pretti appeared to have been doing earlier than he was shot and killed on Saturday — have created a extra complete document of ICE and CBP’s actions within the state. However even this comparatively restricted variety of incidents reveals a transparent sample of unchecked aggression and ongoing escalation by brokers.

“What number of extra residents, what number of extra People have to die or get badly harm for this operation to finish?” Minneapolis Mayor Jacob Frey requested on Saturday. However for the Trump administration, it’s not clear these deaths are very a lot of an issue in any respect.

‘In Botanical Time’ explores the methods Earth’s oldest crops cheat demise

Science

Dr. Mike

January 24, 2026

‘In Botanical Time’ explores the methods Earth’s oldest crops cheat demise

In Botanical Time
Christopher Woods
Chelsea Inexperienced, $40.00

On a talus-strewn slope in jap California’s mountains, a gnarled tree twists towards the sky. It’s Methuselah, a Nice Basin bristlecone pine (Pinus longaeva) and one of many world’s oldest timber. At over 4,800 years outdated, Methuselah germinated a number of hundred years earlier than Imhotep started establishing historic Egypt’s first pyramid.

It’s tough to fathom such an extended life span when people stay mere many years. However creator and backyard professional Christopher Woods’ new e book In Botanical Time helps readers just do that, telling the life tales of millennia-old crops and unpacking the science behind their longevity alongside the best way.

One secret to longevity is to decelerate development, Woods writes. That has helped many historic crops survive in less-than-ideal environments. For instance, rising about 2.5 centimeters per century permits Methuselah to focus its vitality on surviving frigid temperatures, nutrient-poor soil and howling winds. Accumulating genetic adjustments that confer traits like illness resistance has additionally helped.

Different historic crops have a unique method to development: cloning. Clonal crops create copies of themselves — typically by way of their roots — permitting them to succeed in outstanding ages even after the unique iteration dies.

Woods describes one Norway spruce (Picea abies) in Sweden that has cloned itself for 9,500 years, sprouting a brand new trunk from its roots each few centuries. Then there’s Pando. This grove of quaking aspens (Populus tremuloides) in Utah could seem as 47,000 distinct timber, however a glance underground reveals the aspens are a single organism with a root system that’s about 14,000 years outdated. New saplings sprout from Pando’s root system which can be genetically similar to the others, which means at the same time as single timber die, the organism continues to stay on.

Nevertheless, these historic timber are relative infants in comparison with a meadow of Neptune grass (Posidonia oceanica) off the coast of Spain. An evaluation of the ocean grass’ DNA and development price revealed the patch to be between 80,000 to 200,000 years outdated. It grows equally to Pando, by way of rhizomes that ship up genetically similar shoots.

Woods additionally regales readers with mythological tales. In line with one Greek fantasy, dragon timber (Dracaena sp.) sprouted from the blood of the hundred-headed dragon slain by Hercules. Two species, D. cinnabari and D. draco, ooze blood-red sap — one thing so uncommon and astounding that “it might solely be ascribed to fantasy,” Wooden writes.

The oldest identified dragon tree, rising within the Canary Islands, is estimated to be as outdated as 1,000. However it’s tough to nail down exact ages for these timber as a result of the trunk inside is spongy and thus doesn’t have development rings. For a lot of proposed historic crops, an absence of development rings stymies scientists from exactly measuring their age. And relating to timber with development rings, a rotten core can muddle age evaluation as a result of the oldest development rings are lacking.

Although generally repetitive, Woods’ cheeky prose and wealthy visuals make In Botanical Time a simple and fascinating learn for plant lovers and superlative seekers. At a time when longevity and wellness are trending matters, this e book is a reminder that maybe the very best factor to do is stay life somewhat slower.

Purchase In Botanical Time from Bookshop.org. Science Information is a Bookshop.org affiliate and can earn a fee on purchases created from hyperlinks on this article.

High 5 Self Internet hosting Platform Different to Vercel, Heroku & Netlify

Machine Learning

Dr. Mike

January 24, 2026

High 5 Self Internet hosting Platform Different to Vercel, Heroku & Netlify

Picture by Creator

# Introduction

I’ve been vibe coding my Steady Coin Fee platform, working every thing domestically with my very own server setup utilizing Docker Compose.

However in some unspecified time in the future, I spotted one thing vital: there actually shouldn’t be a easy self hosted platform that may deal with scaling, deployment, and multi service Docker administration with out turning right into a full time DevOps job.

This pushed me to start out looking for Vercel type options which might be simple to make use of whereas nonetheless giving me the liberty and management I need.

The self internet hosting platforms I’m going to share come instantly from my very own expertise and the struggles of looking for instruments that really work for vibe coders.

If you would like higher pricing, extra management, robust safety, and actual scalability, these platforms will help you’re taking your facet mission and switch it into one thing that feels a lot nearer to an actual startup.

The most effective half is that getting began doesn’t require something difficult. All you really want is an affordable Hetzner server. Set up one in every of these platforms, a lot of that are designed to simplify deployments so you possibly can give attention to constructing as an alternative of managing infrastructure, and you may be able to deploy manufacturing prepared functions with confidence.

# 1. Dokploy

Dokploy is a steady, easy-to-use deployment resolution designed to simplify utility administration. It serves as a free, self‑hostable various to platforms like Heroku, Vercel, and Netlify, whereas leveraging the facility of Docker and the flexibleness of Traefik to make deployments clean and environment friendly.

Key options:

Simplicity: Simple setup and intuitive administration of deployments.
Flexibility: Helps a variety of functions and databases.
Open Supply: Utterly free and open-source for anybody to make use of.

# 2. Coolify

Coolify is an open‑supply, self‑hostable PaaS that allows you to deploy functions, databases, and companies, similar to WordPress, Ghost, and Believable Analytics, by yourself infrastructure with ease.

It acts as a DIY various to platforms like Heroku, Vercel, and Netlify, enabling you to run static websites, full‑stack apps, and one‑click on companies throughout any server utilizing easy, automated tooling.

Key options:

Deploy Wherever: Helps deployment to any server, together with VPS, Raspberry Pi, EC2, Hetzner, and extra through SSH, giving full flexibility over infrastructure.
Huge Expertise Help: Works with nearly any language or framework, enabling deployment of static websites, APIs, backends, databases, and lots of fashionable app stacks like Subsequent.js, Nuxt.js, and SvelteKit.
Built-in Git & Automation: Presents push‑to‑deploy with GitHub, GitLab, Bitbucket, and Gitea, plus automated SSL, server setup automation, and pull request deployments for clean CI/CD workflows.

# 3. Appwrite

Appwrite is an open‑supply backend‑as‑a‑service platform that now gives full‑stack capabilities because of its Websites function, which helps you to deploy web sites instantly alongside your backend companies.

Since full‑stack growth means dealing with each frontend and backend elements and Appwrite now helps web site internet hosting plus APIs, auth, databases, storage, messaging, and features, it supplies every thing wanted to construct, deploy, and scale full functions inside a single platform.

Key options:

Finish‑to‑Finish Full‑Stack Platform: With Websites for frontend internet hosting and strong backend instruments like Auth, Databases, Capabilities, Storage, Messaging, and Realtime, Appwrite covers your complete internet stack.
Versatile Integration Strategies: Helps SDKs, REST, GraphQL, and Realtime APIs, permitting seamless integration from any language or framework.
Information Possession & Simple Migration: Presents migration instruments from Firebase, Supabase, Nhost, and self‑hosted setups so builders can simply transfer tasks whereas protecting full management of their knowledge.

# 4. Dokku

Dokku is an extensible, open‑supply Platform‑as‑a‑Service that runs on a single server of your selection, functioning very similar to a self‑hosted mini‑Heroku. It builds functions robotically from a easy git push utilizing both Dockerfiles or language autodetection through Buildpacks, then runs them inside remoted containers.

Dokku additionally integrates applied sciences like nginx and cron to route internet visitors and handle background processes, giving builders a light-weight however highly effective option to deploy and function apps on their very own infrastructure.

Key options:

Git‑Powered Deployments: Push code through Git to construct apps on the fly utilizing Dockerfiles or Buildpacks, much like Heroku’s workflow.
Light-weight Single‑Server PaaS: Runs on any Ubuntu/Debian server and makes use of Docker to handle app lifecycles, making it simple to self‑host a Heroku‑like setting on minimal {hardware}.
Extensible & Plugin‑Pleasant: Helps a large ecosystem of neighborhood and official plugins, permitting builders so as to add databases, storage, monitoring, and extra to their deployments.

# 5. Juno

Juno is an open‑supply serverless platform that allows you to construct, deploy, and run functions in safe WASM containers whereas sustaining full self‑internet hosting management and 0 DevOps. It supplies a whole backend stack, together with key‑worth knowledge storage, authentication, file storage, analytics, and serverless features, so builders can create fashionable apps with out managing infrastructure.

Juno additionally helps internet hosting static websites, constructing full internet apps, and working features with the privateness and sovereignty of self‑internet hosting, all whereas providing a well-recognized, cloud‑like developer expertise.

Key options:

Full Serverless Stack with Self‑Internet hosting Management: Consists of datastore, storage, auth, analytics, and serverless features working in safe WASM containers, providing you with full possession of your apps and knowledge.
Zero‑Setup Developer Expertise: Use native emulation for growth and deploy to remoted containers (“Satellites”) with no DevOps required and a workflow much like fashionable cloud platforms.
Constructed for Net Builders: Use your favourite frontend frameworks and write serverless features in Rust or TypeScript, with templates and instruments that simplify constructing full‑stack apps.

# Comparability Desk

This comparability desk highlights what every platform is greatest for, the way you deploy to it, and the sorts of functions it will probably run so you possibly can shortly choose the fitting self-hosted various on your workflow.

Platform	Finest for	Deploy workflow	What it runs
Dokploy	Easy “Heroku-style” self-hosting with robust Docker Compose help	UI-driven deploys + Docker Compose	Containers, Compose apps
Coolify	Closest really feel to a self-hosted Vercel/Netlify, plus plenty of prebuilt companies	Git push to deploy (GitHub/GitLab/Bitbucket/Gitea) + automation	Static websites, full-stack apps, companies
Appwrite (with Websites)	One platform for backend (Auth/DB/Storage/Capabilities) plus frontend internet hosting	Join Git repo or use templates for Websites	Frontends + backend companies
Dokku	Light-weight “mini-Heroku” on a single server	git push deploys through Buildpacks or Dockerfile	Containerized apps
Juno	Serverless-style apps with self-hosting management and minimal ops	CLI or GitHub Actions deploy to “Satellites”	Static websites, internet apps, WASM-based serverless features

Abid Ali Awan (@1abidaliawan) is a licensed knowledge scientist skilled who loves constructing machine studying fashions. Presently, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in know-how administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college students fighting psychological sickness.

How Machine Studying and Semantic Embeddings Reorder CVE Vulnerabilities Past Uncooked CVSS Scores

Artificial Intelligence

Dr. Mike

January 24, 2026

How Machine Studying and Semantic Embeddings Reorder CVE Vulnerabilities Past Uncooked CVSS Scores

def visualize_results(df, priority_scores, feature_importance):
   fig, axes = plt.subplots(2, 3, figsize=(18, 10))
   fig.suptitle('Vulnerability Scanner - ML Evaluation Dashboard', fontsize=16, fontweight="daring")
   axes[0, 0].hist(priority_scores, bins=30, shade="crimson", alpha=0.7, edgecolor="black")
   axes[0, 0].set_xlabel('Precedence Rating')
   axes[0, 0].set_ylabel('Frequency')
   axes[0, 0].set_title('Precedence Rating Distribution')
   axes[0, 0].axvline(np.percentile(priority_scores, 75), shade="orange", linestyle="--", label="seventy fifth percentile")
   axes[0, 0].legend()
   axes[0, 1].scatter(df['cvss_score'], priority_scores, alpha=0.6, c=priority_scores, cmap='RdYlGn_r', s=50)
   axes[0, 1].set_xlabel('CVSS Rating')
   axes[0, 1].set_ylabel('ML Precedence Rating')
   axes[0, 1].set_title('CVSS vs ML Precedence')
   axes[0, 1].plot([0, 10], [0, 1], 'k--', alpha=0.3)
   severity_counts = df['severity'].value_counts()
   colours = {'CRITICAL': 'darkred', 'HIGH': 'crimson', 'MEDIUM': 'orange', 'LOW': 'yellow'}
   axes[0, 2].bar(severity_counts.index, severity_counts.values, shade=[colors.get(s, 'gray') for s in severity_counts.index])
   axes[0, 2].set_xlabel('Severity')
   axes[0, 2].set_ylabel('Depend')
   axes[0, 2].set_title('Severity Distribution')
   axes[0, 2].tick_params(axis="x", rotation=45)
   top_features = feature_importance.head(10)
   axes[1, 0].barh(top_features['feature'], top_features['importance'], shade="steelblue")
   axes[1, 0].set_xlabel('Significance')
   axes[1, 0].set_title('Prime 10 Characteristic Significance')
   axes[1, 0].invert_yaxis()
   if 'cluster' in df.columns:
       cluster_counts = df['cluster'].value_counts().sort_index()
       axes[1, 1].bar(cluster_counts.index, cluster_counts.values, shade="teal", alpha=0.7)
       axes[1, 1].set_xlabel('Cluster')
       axes[1, 1].set_ylabel('Depend')
       axes[1, 1].set_title('Vulnerability Clusters')
   attack_vector_counts = df['attack_vector'].value_counts()
   axes[1, 2].pie(attack_vector_counts.values, labels=attack_vector_counts.index, autopct="%1.1f%%", startangle=90)
   axes[1, 2].set_title('Assault Vector Distribution')
   plt.tight_layout()
   plt.present()


def fundamental():
   print("="*70)
   print("AI-ASSISTED VULNERABILITY SCANNER WITH ML PRIORITIZATION")
   print("="*70)
   print()
   fetcher = CVEDataFetcher()
   df = fetcher.fetch_recent_cves(days=30, max_results=50)
   print(f"Dataset Overview:")
   print(f"  Whole CVEs: {len(df)}")
   print(f"  Date Vary: {df['published'].min()[:10]} to {df['published'].max()[:10]}")
   print(f"  Severity Breakdown: {df['severity'].value_counts().to_dict()}")
   print()
   feature_extractor = VulnerabilityFeatureExtractor()
   embeddings = feature_extractor.extract_semantic_features(df['description'].tolist())
   df = feature_extractor.extract_keyword_features(df)
   df = feature_extractor.encode_categorical_features(df)
   prioritizer = VulnerabilityPrioritizer()
   X = prioritizer.prepare_features(df, embeddings)
   severity_map = {'LOW': 0, 'MEDIUM': 1, 'HIGH': 2, 'CRITICAL': 3, 'UNKNOWN': 1}
   y_severity = df['severity'].map(severity_map).values
   y_score = df['cvss_score'].values
   X_scaled = prioritizer.train_models(X, y_severity, y_score)
   priority_scores, severity_probs, score_preds = prioritizer.predict_priority(X)
   df['ml_priority_score'] = priority_scores
   df['predicted_score'] = score_preds
   analyzer = VulnerabilityAnalyzer(n_clusters=5)
   clusters = analyzer.cluster_vulnerabilities(embeddings)
   df = analyzer.analyze_clusters(df, clusters)
   feature_imp, emb_imp = prioritizer.get_feature_importance()
   print(f"n--- Characteristic Significance ---")
   print(feature_imp.head(10))
   print(f"nAverage embedding significance: {emb_imp:.4f}")
   print("n" + "="*70)
   print("TOP 10 PRIORITY VULNERABILITIES")
   print("="*70)
   top_vulns = df.nlargest(10, 'ml_priority_score')[['cve_id', 'cvss_score', 'ml_priority_score', 'severity', 'description']]
   for idx, row in top_vulns.iterrows():
       print(f"n{row['cve_id']} [Priority: {row['ml_priority_score']:.3f}]")
       print(f"  CVSS: {row['cvss_score']:.1f} | Severity: {row['severity']}")
       print(f"  {row['description'][:100]}...")
   print("nnGenerating visualizations...")
   visualize_results(df, priority_scores, feature_imp)
   print("n" + "="*70)
   print("ANALYSIS COMPLETE")
   print("="*70)
   print(f"nResults abstract:")
   print(f"  Excessive Precedence (>0.7): {(priority_scores > 0.7).sum()} vulnerabilities")
   print(f"  Medium Precedence (0.4-0.7): {((priority_scores >= 0.4) & (priority_scores <= 0.7)).sum()}")
   print(f"  Low Precedence (<0.4): {(priority_scores < 0.4).sum()}")
   return df, prioritizer, analyzer


if __name__ == "__main__":
   results_df, prioritizer, analyzer = fundamental()
   print("n✓ All analyses accomplished efficiently!")
   print("nYou can now:")
   print("  - Entry outcomes through 'results_df' DataFrame")
   print("  - Use 'prioritizer' to foretell new vulnerabilities")
   print("  - Discover 'analyzer' for clustering insights")

Again from the useless, a black gap is erupting after a 100-million-year hiatus

Science

Dr. Mike

January 24, 2026

Again from the useless, a black gap is erupting after a 100-million-year hiatus

January 24, 2026

3 min learn

Add Us On GoogleAdd SciAm

Again from the useless, a black gap is erupting after a 100-million-year hiatus

Radio photos captured this “cosmic volcano” being reborn on the coronary heart of the galaxy J1007+3540

By Ok. R. Callaway edited by Lee Billings

This is an image of galaxy J1007+3540, with its bright jets of cosmic material glowing red against the black of space. — After 100 million years of dormancy, the supermassive black gap on the heart of galaxy J1007+3540 is glowing shiny.

LOFAR/Pan-STARRS/S. Kumari et al.

Inside an extremely shiny cluster of galaxies, a long-dormant supermassive black gap has come again to life. Radio photos captured a one-million-light-year-long stream of star-forming particles and fuel emanating from the black gap on the heart of the galaxy J1007+3540—which apparently is erupting for the primary time in about 100 million years.

“Though some ‘restarted’ radio galaxies are recognized within the literature, J1007+3540 stands out,” says lead research writer Shobha Kumari of Midnapore Metropolis Faculty in India. The outcome just lately appeared within the Month-to-month Notices of the Royal Astronomical Society.

J1007+3540 is an uncommonly massive instance of an episodic galaxy, whereby a central supermassive black gap solely intermittently emits distinguished jets of particles and fuel, virtually as if an astrophysical on-off swap was flipped. Researchers say the knowledge they acquire from the eruption of this “cosmic volcano” may assist them higher perceive episodic galaxies’ constructions, evolution and affect on their environment.

On supporting science journalism

For those who’re having fun with this text, contemplate supporting our award-winning journalism by subscribing. By buying a subscription you’re serving to to make sure the way forward for impactful tales in regards to the discoveries and concepts shaping our world at the moment.

Ejected jets are a constant however not ubiquitous function of the supermassive black holes on the hearts of galaxies, which, when erupting, are additionally referred to as lively galactic nuclei (AGNs). Many AGNs are regarded as episodic, ebbing as they exhaust surrounding reservoirs of fuel, solely to surge once more when extra materials drifts inside attain. This cycle elapses throughout 1000’s of years—glacially gradual to us however virtually instantaneous on cosmic scales.

That makes episodic exercise and the on-off transition troublesome to catch because it happens. Reasonably than making an attempt to watch the adjustments themselves, scientists typically analyze the constructions inside galaxies they assume come up from a central black gap’s episodic outbursts. If the black gap is dormant, they search for echoes of its previous lively part, akin to high-energy gentle or ionized fuel that has traveled farther out from the galaxy’s heart. And, in fact, if a galaxy’s central black gap is in its AGN part, like J1007+3540’s, the proof is clear.

The radio photos of J1007+3540—taken utilizing interferometers on the Low Frequency Array within the Netherlands and the upgraded Large Meterwave Radio Telescope in India—seize each phases in a single goal. The galaxy sports activities not solely a shiny new child jet but additionally a surrounding surfeit of older materials blasted out by previous AGN episodes. Whereas different episodic galaxies are anticipated to have related constructions, J1007+3540’s are particularly clear.

“This technique is simply bodily very massive, and that makes it extra amenable to check in some ways,” explains Niel Brandt, an astrophysicist at Pennsylvania State College. “You possibly can go in and research it in appreciable element.”

One in all these particulars, a faint, fragmented tail of previous materials extending out into intergalactic house stirred by subsequent outbursts to shine anew, reveals how J1007+3540’s AGN part can influence its cosmic neighborhood—particularly, the fuel pervading the galaxy cluster the place J1007+3540 resides, often known as the intracluster medium (ICM). The form and brightness of the rekindled tail hint the advanced interactions that occurred between the AGN’s ejected jet and the ICM because the jet propagated outward.

“These observations assist us perceive that the connection between a galaxy’s jets and the cluster setting could be very dynamic,” says Vivian U, an astronomer on the College of California, Irvine. “The jets don’t simply carve a path by means of empty house—they’re continuously formed and adjusted by the fuel they encounter.”

There’s nonetheless lots left to study how interactions with the ICM can suggestions to alter the shape and conduct of a galaxy’s jets, all of which may spark (or suppress) the creation of recent generations of stars. By some means the sparkle and flutter of AGN on the hearts of galaxies could dictate whether or not they shine for eons or fade to starless black.

“The oddballs are thrilling,” says Phil Hopkins, a theoretical astrophysicist on the California Institute of Know-how. Observing uncommon instances like J1007+3540 provides researchers the chance to check and enhance their fashions of how this majestic course of unfolds.

It’s Time to Stand Up for Science

For those who loved this text, I’d wish to ask in your help. Scientific American has served as an advocate for science and trade for 180 years, and proper now will be the most crucial second in that two-century historical past.

I’ve been a Scientific American subscriber since I used to be 12 years previous, and it helped form the way in which I take a look at the world. SciAm at all times educates and delights me, and conjures up a way of awe for our huge, lovely universe. I hope it does that for you, too.

For those who subscribe to Scientific American, you assist make sure that our protection is centered on significant analysis and discovery; that we’ve the sources to report on the selections that threaten labs throughout the U.S.; and that we help each budding and dealing scientists at a time when the worth of science itself too typically goes unrecognized.

In return, you get important information, charming podcasts, good infographics, can’t-miss newsletters, must-watch movies, difficult video games, and the science world’s greatest writing and reporting. You possibly can even reward somebody a subscription.

There has by no means been a extra necessary time for us to face up and present why science issues. I hope you’ll help us in that mission.

Versatile discrete alternative modeling utilizing a multinomial probit mannequin, half 1

Econometrics

Dr. Mike

January 24, 2026

Versatile discrete alternative modeling utilizing a multinomial probit mannequin, half 1

(newcommand{xb}{{bf x}}
newcommand{betab}{boldsymbol{beta}}
newcommand{zb}{{bf z}}
newcommand{gammab}{boldsymbol{gamma}})Now we have no alternative however to decide on

We make selections day by day, and infrequently these selections are made amongst a finite variety of potential options. For instance, will we take the automobile or trip a motorcycle to get to work? Will we now have dinner at house or eat out, and if we eat out, the place will we go? Scientists, advertising analysts, or political consultants, to call a number of, want to discover out why folks select what they select.

On this submit, I present some background about discrete alternative fashions, particularly, the multinomial probit mannequin. I talk about this mannequin from a random utility mannequin perspective and present you methods to simulate information from it. That is useful for understanding the underpinnings of this mannequin. In my subsequent submit, we are going to use the simulated information to exhibit methods to estimate and interpret results of curiosity.

Random utility mannequin and discrete alternative

An individual confronted with a discrete set of options is assumed to decide on the choice that maximizes his or her utility in some outlined approach. Utilities are usually conceived of as the results of a perform that consists of an noticed deterministic and an unobserved random half, as a result of not all components which may be related for a given determination will be noticed. The regularly used linear random utility mannequin is

[U_{ij} = V_{ij} + epsilon_{ij}, hspace{5mm} j = 1,…,J]

the place (U_{ij}) is the utility of the (i)th particular person associated to the (j
)th various, (V_{ij}) is the noticed element, and (epsilon_{ij}) is the unobserved element. Within the context of regression modeling, the noticed half, (V_{ij}), is often construed as some linear or nonlinear mixture of noticed traits associated to people and options and corresponding parameter estimates, whereas the parameters are estimated based mostly on a mannequin that makes sure assumptions concerning the distribution of the unobserved elements, (epsilon_{ij}).

Motivating instance

Let’s check out an instance. Suppose that people can enroll in one in every of three medical insurance plans: Sickmaster, Allgood, and Cowboy Well being. Thus we now have the next set of options:

[s={mathrm{Sickmaster},mathrm{Allgood},mathrm{Cowboy, Health}}]

We’d count on an individual’s utility associated to every of the three options to be a perform of each private traits (akin to revenue or age) and traits of the well being care plan (akin to its worth).

We would pattern people and ask them which well being plan they would like in the event that they needed to enroll in one in every of them. If we collected information on the individual’s age (in a long time), the individual’s family revenue (in $10,000), and the worth of a plan (in $100/month), then our information would possibly look one thing like the primary three circumstances from the simulated information beneath:


. listing in 1/9, sepby(id)

     +-----------------------------------------------------------+
     | id             alt   alternative   hhinc   age   worth       U |
     |-----------------------------------------------------------|
  1. |  1      Sickmaster        1    3.66   2.1    2.05    2.38 |
  2. |  1         Allgood        0    3.66   2.1    1.73   -1.04 |
  3. |  1   Cowboy Well being        0    3.66   2.1    1.07   -2.61 |
     |-----------------------------------------------------------|
  4. |  2      Sickmaster        0    3.75   4.2    2.19   -2.97 |
  5. |  2         Allgood        1    3.75   4.2    1.12    0.29 |
  6. |  2   Cowboy Well being        0    3.75   4.2    0.78   -2.22 |
     |-----------------------------------------------------------|
  7. |  3      Sickmaster        0    2.32   2.4    2.25   -4.49 |
  8. |  3         Allgood        0    2.32   2.4    1.31   -5.76 |
  9. |  3   Cowboy Well being        1    2.32   2.4    1.02    1.19 |
     +-----------------------------------------------------------+

Taking the primary case (id==1), we see that the case-specific variables hhinc and age are fixed throughout options and that the alternative-specific variable worth varies over options.

The variable alt labels the options, and the binary variable alternative signifies the chosen various (it’s coded 1 for the chosen plan, and 0 in any other case). As a result of it is a simulated dataset, we all know the underlying utilties that correspond to every various, and people are given in variable U. The primary respondent’s utility is highest for the primary various, and so the result variable alternative takes the worth 1 for alt==”Sickmaster” and 0 in any other case. That is the marginal distribution of circumstances over options:


. tabulate alt if alternative == 1

    Insurance coverage |
         plan |      Freq.     %        Cum.
--------------+-----------------------------------
   Sickmaster |      6,315       31.57       31.57
      Allgood |      8,308       41.54       73.11
Cowboy Well being |      5,377       26.89      100.00
--------------+-----------------------------------
        Complete |     20,000      100.00

As we are going to see beneath, a helpful mannequin for analyzing most of these information is the multinomial probit mannequin.

Multinomial probit mannequin

The multinomial probit mannequin is a discrete alternative mannequin that’s based mostly on the idea that the unobserved elements in (epsilon_{ij}) come from a traditional distribution. Completely different probit fashions come up from totally different specs of (V_{ij}) and totally different assumptions about (epsilon_{ij}). For instance, with a primary multinomial probit mannequin, as is applied in Stata’s mprobit command (see [R] mprobit), we specify (V_{ij}) to be

[V_{ij} = xb_{i}betab_{j}^{,’}]

the place (xb_{i}) is a vector of individual-specific covariates, and (betab_{j}) is the corresponding parameter vector for various (j). The random elements (epsilon_{ij}) are assumed to come back from a multivariate regular distribution with imply zero and identification variance–covariance matrix. For instance, if we had three options, we’d assume

start{equation*}
epsilon_{ij} {elevate.17exhbox{$scriptstylesim$}} mathcal{N}(0,Sigma) , hspace{5mm}
Sigma =
start{bmatrix}
1 & 0 & 0
& 1 & 0
& & 1
finish{bmatrix}
finish{equation*}

Specifying the above covariance construction signifies that the unobserved elements, (epsilon_{ij}), are assumed to be homoskedastic and unbiased throughout options.

Independence implies that variations in utility between any two options depend upon these two options however not on any of the opposite options. This property is called the independence from irrelevant options (IIA) assumption. When the IIA assumption holds, it could result in a lot of handy benefits akin to finding out solely a subset of options (see Practice [2009, 48]). Nonetheless, IIA is a reasonably restrictive assumption which may not maintain.

Persevering with with our well being care plan instance, suppose that Sickmaster and Allgood each favor folks with well being issues, whereas Cowboy Well being favors individuals who solely hardly ever see a health care provider. On this case, we’d count on the utilities that correspond to options Sickmaster and Allgood to be positively correlated whereas being negatively correlated with the utility comparable to Cowboy Well being. In different phrases, utilities with respect to options Sickmaster and Allgood are associated to these of Cowboy Well being. On this case, we should use a mannequin that relaxes the IIA assumption and permits for correlated utilities throughout options.

One other potential limitation of our multinomial probit specification issues the noticed (V_{ij}), which consists of the linear mixture of individual-specific variables and alternative-specific parameters. In different phrases, we solely contemplate noticed variables that modify over individuals however not over options. In a setting like this, we’d use

[V_{ij} = xb_{i}betab_{j}’ + zb_{ij}gammab’]

the place (zb_{ij}) are alternative-specific variables that modify each over people and options and (gammab) is the corresponding parameter vector. Combining this with our extra versatile assumptions concerning the unobservables, we are able to write our mannequin as

[U_{ij} = xb_{i}betab_{j}’ + zb_{ij}gammab’ + epsilon_{ij}, hspace{5mm} j = 1,…,J]

with (epsilon_{ij} {elevate.17exhbox{(scriptstylesim)}} mathcal{N}(0,Sigma)).

Assuming unstructured correlation and heteroskedastic errors throughout (J=3) options, for instance, (Sigma) is given by

start{equation*}
Sigma =
start{bmatrix}
sigma_{11} & sigma_{12} & sigma_{13}
& sigma_{22} & sigma_{23}
& & sigma_{33}
finish{bmatrix}
finish{equation*}

As we are going to see later, we are able to match this mannequin in Stata with the asmprobit command; see [R] asmprobit for particulars concerning the command and applied strategies.

We stated in our well being plan instance that we predict that the worth that particular person (i) has to pay for the plan is vital and it might range each over people and options. We will subsequently write our utility mannequin for 3 options as

[U_{ij} = beta_{j,mathtt{cons}} + beta_{j,mathtt{hhinc}}{tt hhinc}_{i} + beta_{j,mathtt{age}}{tt age}_{i} + gamma {tt price}_{ij} + epsilon_{ij}, hspace{5mm} j = 1,2,3]

Simulation

We will simulate information assuming the data-generating course of given within the above mannequin. We’ll specify the 2 case-specific variables, family revenue (hhinc) and age (age), and we are going to take the worth of the plan (worth) because the alternative-specific variable. The case-specific variables hhinc and age might be fixed throughout options inside every particular person, whereas the alternative-specific variable worth will range over people and inside people over options.

We specify the next inhabitants parameters for (betab_{j}) and (gamma):

start{align*}
beta_{1,mathtt{cons}} &= -1, &beta_{1,mathtt{hhinc}} &= hspace{2.7mm} 1, &beta_{1,mathtt{age}} &= -1
beta_{2,mathtt{cons}} &= -6, &beta_{2,mathtt{hhinc}} &= hspace{2.7mm} 0.5, &beta_{2,mathtt{age}} &= hspace{2.7mm} 1
beta_{3,mathtt{cons}} &= hspace{2.7mm} 2, &beta_{3,mathtt{hhinc}} &= -1, &beta_{3,mathtt{age}} &= hspace{2.7mm} 0.5
gamma &= -0.5
finish{align*}

For (epsilon_{ij}), we are going to specify the next:

start{equation*}
epsilon_{ij} {elevate.17exhbox{$scriptstylesim$}} mathcal{N}(0,Sigma), hspace{5mm}
Sigma =
start{bmatrix}
2.1 & 0.6 & -0.5
& 1.7 & -0.8
& & 1.4
finish{bmatrix}
finish{equation*}

With these specs, we are able to now create a simulated dataset. We begin by drawing our three error phrases and two case-specific covariates:


. clear

. set seed 65482

. set obs 20000
variety of observations (_N) was 0, now 20,000

. generate id = _n

. scalar s11 =  2.1

. scalar s22 =  1.7

. scalar s33 =  1.4

. scalar s12 =  0.6

. scalar s13 = -0.5

. scalar s23 = -0.8

. mat C = (s11,s12,s13)  
>         (s12,s22,s23)  
>         (s13,s23,s33)

. drawnorm e1 e2 e3, cov(C)

. generate double hhinc = max(0,rnormal(5,1.5))

. generate double age = runiformint(20,60)/10

To permit for various particular covariates, we are going to develop our information so that we are going to have one remark for every various for every case, then create an index for the options, after which generate our variables ({tt worth}_{ij}):


. develop 3
(40,000 observations created)

. bysort id : gen alt = _n

. generate double worth = rbeta(2,2) + 1.50 if alt == 1
(40,000 lacking values generated)

. substitute         worth = rbeta(2,2) + 0.75 if alt == 2
(20,000 actual modifications made)

. substitute         worth = rbeta(2,2) + 0.25 if alt == 3
(20,000 actual modifications made)

We will now go forward and generate three variables for the noticed utility elements, one for every various:


. generate double xb1 = -1.0 + 1.0*hhinc - 1.0*age - 0.5*worth

. generate double xb2 = -6.0 + 0.5*hhinc + 1.0*age - 0.5*worth

. generate double xb3 =  2.0 - 1.0*hhinc + 0.5*age - 0.5*worth

To calculate the utilities that correspond to every various, we add the unobserved to the noticed elements:


. native snorm = sqrt((s11 + s22 - 2*s12)/2)

. generate double U1 = xb1*`snorm' + e1

. generate double U2 = xb2*`snorm' + e2

. generate double U3 = xb3*`snorm' + e3

Wanting on the code above, you’ll discover that we included an element to scale our specified inhabitants parameters. This is because of identification particulars associated to our mannequin that I clarify additional within the Identification part. One factor we have to know now, nevertheless, is that for the mannequin to be recognized, the utilities have to be normalized for degree and scale. Normalizing for degree is simple as a result of, since we’re solely within the utilities relative to one another, we are able to outline a base-level various after which take the variations of utilities with respect to the set base. If we set the primary various as the bottom, we are able to rewrite our mannequin as follows:

start{align*}U^{*}_{ij} &= beta_{j,mathtt{cons}}-beta_{1,mathtt{cons}} + (beta_{j,mathtt{hhinc}}-beta_{1,mathtt{hhinc}}){tt hhinc}_{i} +
(beta_{j,mathtt{age}}-beta_{1,mathtt{age}}){tt age}_{i}
&quad +
gamma ({tt worth}_{ij}-{tt worth}_{i1}) + epsilon_{ij}-epsilon_{i1}, hspace{5mm} j = 2,3
finish{align*}

This suggests that solely (J-1) parameter vectors in (betab) are recognized. Let’s outline these parameters as

start{align*}
Delta beta_{j,mathtt{cons}} &= beta_{j,mathtt{cons}}-beta_{1,mathtt{cons}}
Delta beta_{j,mathtt{hhinc}} &= beta_{j,mathtt{hhinc}}-beta_{1,mathtt{hhinc}}
Delta beta_{j,mathtt{age}} &= beta_{j,mathtt{age}}-beta_{1,mathtt{age}}
finish{align*}

for (j = 2,3). The parameters in (betab_{j}) that we are going to attempt to recuperate will then be the next variations:

start{align*}
Delta beta_{2,mathtt{cons}} &= -5
Delta beta_{3,mathtt{cons}} &= hspace{2.7mm} 3
Delta beta_{2,mathtt{hhinc}} &= -0.5
Delta beta_{3,mathtt{hhinc}} &= -2
Delta beta_{2,mathtt{age}} &= hspace{2.7mm} 2
Delta beta_{3,mathtt{age}} &= hspace{2.7mm} 1.5
finish{align*}

What’s left to finish our simulated dataset is to generate the result variable that takes the worth 1 if remark (i) chooses various (ok), and 0 in any other case. To do that, we are going to first create a single variable for the utilities after which decide the choice with the best utility:


. quietly generate double U = .

. quietly generate y = .

. forval i = 1/3 {
  2.     quietly substitute U = U`i' if alt==`i'
  3. }

. bysort id : egen double umax_i = max(U)

. forval i = 1/3 {
  2.     quietly bysort id : substitute y = alt if umax_i == U
  3. }

. generate alternative = alt == y

We get hold of the next by utilizing asmprobit:


. asmprobit alternative worth, case(id) options(alt) casevars(hhinc age) 
> basealternative(1) scalealternative(2) nolog

Various-specific multinomial probit      Variety of obs      =     60,000
Case variable: id                            Variety of circumstances    =     20,000

Various variable: alt                    Alts per case: min =          3
                                                            avg =        3.0
                                                            max =          3
Integration sequence:      Hammersley
Integration factors:               150           Wald chi2(5)    =    4577.15
Log simulated-likelihood = -11219.181           Prob > chi2     =     0.0000

----------------------------------------------------------------------------
      alternative |     Coef.   Std. Err.      z    P>|z|    [95% Conf. Interval]
-------------+--------------------------------------------------------------
alt          |
       worth | -.4896106   .0523626    -9.35   0.000   -.5922394   -.3869818
-------------+---------------------------------------------------------------
Sickmaster   | (base various)
-------------+--------------------------------------------------------------
Allgood      |
       hhinc | -.5006212   .0302981   -16.52   0.000   -.5600043    -.441238
         age |  2.001367   .0306663    65.26   0.000    1.941262    2.061472
       _cons | -4.980841   .1968765   -25.30   0.000   -5.366711    -4.59497
-------------+--------------------------------------------------------------
Cowboy_Hea~h |
       hhinc | -1.991202   .1092118   -18.23   0.000   -2.205253    -1.77715
         age |  1.494056   .0446662    33.45   0.000    1.406512    1.581601
       _cons |  3.038869   .4066901     7.47   0.000    2.241771    3.835967
-------------+--------------------------------------------------------------
     /lnl2_2 |  .5550228   .0742726     7.47   0.000    .4094512    .7005944
-------------+--------------------------------------------------------------
       /l2_1 |   .667308   .1175286     5.68   0.000    .4369562    .8976598
----------------------------------------------------------------------------
(alt=Sickmaster is the choice normalizing location)
(alt=Allgood is the choice normalizing scale)

Wanting on the above output, we see that the coefficient of the alternative-specific variable worth is (widehat gamma = -0.49), which is near our specified inhabitants parameter of (gamma = -0.50). We will say the identical about our case-specific variables. The estimated coefficients of hhinc are (widehat Delta beta_{2,mathtt{hhinc}} = -0.50) for the second and (widehat Delta beta_{3,mathtt{hhinc}} = -1.99) for the third various. The estimates for age are (widehatDelta beta_{2,mathtt{age}} = 2.00) and (widehat Delta beta_{3,mathtt{age}} = 1.49). The estimated variations in alternative-specific constants are (widehat Delta beta_{2,mathtt{cons}} = -4.98) and (widehat Delta beta_{3,mathtt{cons}} = 3.04).

Identification

Now let me shed extra mild on the identification particulars associated to our mannequin that we wanted to think about once we simulated our dataset. An vital characteristic of (U_{ij}) is that the extent in addition to the dimensions of utility is irrelevant with respect to the chosen various as a result of shifting the extent by some fixed quantity, or multiplying it by a (optimistic) fixed, doesn’t change the rank order of utilities and thus would don’t have any influence on the chosen various. This has vital ramifications for modeling utilities as a result of and not using a set degree and scale of (U_{ij}), there are an infinite variety of parameters in (V_{ij}) that yield the identical end result by way of the chosen options. Subsequently, utilities have to be normalized to establish the parameters of the mannequin.

We already noticed methods to normalize for degree. Normalizing for scale is a little more tough, although, as a result of we assume correlated and heteroskedastic errors. Due to the hetersokedasticity, we have to set the dimensions for one of many variances after which estimate the opposite variances in relation to the set variance. We should additionally account for the nonzero covariance between the errors, which makes extra figuring out restrictions crucial. It seems that given our mannequin assumptions, solely (J(J-1)/2-1) parameters of our variance–covariance matrix are identifiable (see chapter 5 in Practice [2009] for particulars about figuring out restrictions within the context of probit fashions). To be concrete, our unique variance–covariance matrix was the next:

start{equation*}
Sigma =
start{bmatrix}
sigma_{11} & sigma_{12} & sigma_{13}
& sigma_{22} & sigma_{23}
& & sigma_{33}
finish{bmatrix}
finish{equation*}

Taking variations of correlated errors reduces the (3 occasions 3) matrix of error variances to a (2 occasions 2) variance–covariance matrix of error variations:

start{equation*}
Sigma^{*} =
start{bmatrix}
& sigma_{11}+sigma_{22}-2sigma_{12} & sigma_{11}+sigma_{23}-sigma_{12}-sigma_{13}
& & sigma_{11}+sigma_{33}-2sigma_{13}
finish{bmatrix}
finish{equation*}

If we normalize this matrix with respect to the second various, we get

start{equation*}
widetilde Sigma^{*} =
start{bmatrix}
& 1 & (sigma_{11}+sigma_{23}-sigma_{12}-sigma_{13})/nu
& & (sigma_{11}+sigma_{33}-2sigma_{13})/nu
finish{bmatrix}
finish{equation*}

the place (nu = sigma_{11}+sigma_{22}-2sigma_{12}). As a result of we additionally need to set the dimensions for our base various, our normalized matrix turns into

start{equation*}
examine Sigma^{*} =
start{bmatrix}
& 2 & 2(sigma_{11}+sigma_{23}-sigma_{12}-sigma_{13})/nu
& & 2(sigma_{11}+sigma_{33}-2sigma_{13})/nu
finish{bmatrix}
finish{equation*}

Thus, as a result of utilities are scaled by the usual deviation, they’re divided by (sqrt{nu/2}). Now, getting again to our simulation, if we want to recuperate our specified parameters, we have to scale them accordingly. We begin from the variance–covariance matrix of error variations:

start{equation*}
Sigma^{*} =
start{bmatrix}
& 2.1 + 1.7 – 2*0.6 & 2.1 – 0.8 – 0.6 + 0.5
& & 2.1 + 1.4 – 2*-0.5
finish{bmatrix}
=
start{bmatrix}
& 2.6 & 1.2
& & 4.5
finish{bmatrix}
finish{equation*}

Normalizing with respect to the second various yields

start{equation*}
widetilde Sigma^{*} =
start{bmatrix}
& 1 & 1.2/2.6
& & 4.5/2.6
finish{bmatrix}
=
start{bmatrix}
& 1 & 0.4615
& & 1.7308
finish{bmatrix}
finish{equation*}

after which multiplying (tilde Sigma^{*}) by 2 yields

start{equation*}
examine Sigma^{*} =
start{bmatrix}
& 2 & 0.9231
& & 3.4615
finish{bmatrix}
finish{equation*}

that are the true variance–covariance parameters. Our scaling time period is (sqrt{2.6/2}), and since utilities might be divided by this time period, we might want to multiply our parameters by this time period.

Lastly, we examine if we are able to recuperate our variance–covariance parameters. We use the postestimation command estat covariance to show the estimated variance–covariance matrix of error variations:


. estat covariance

  +-------------------------------------+
  |              |   Allgood  Cowboy_~h |
  |--------------+----------------------|
  |      Allgood |         2            |
  | Cowboy_Hea~h |   .943716   3.479797 |
  +-------------------------------------+
Notice: Covariances are for options differenced with Sickmaster.

We see that our estimate is near the true normalized covariance matrix.

Conclusion

I mentioned multinomial probit fashions in a discrete alternative context and confirmed methods to generate a simulated dataset accordingly. In my subsequent submit, we are going to use our simulated dataset and talk about estimation and interpretation of mannequin outcomes, which isn’t as easy as one would possibly assume.

Reference

Practice, Okay. E. 2009. Discrete Alternative Strategies with Simulation. 2nd ed. New York: Cambridge College Press.

Air for Tomorrow: Mapping the Digital Air-High quality Panorama, from Repositories and Information Sorts to Starter Code

Machine Learning

Dr. Mike

January 24, 2026

Air for Tomorrow: Mapping the Digital Air-High quality Panorama, from Repositories and Information Sorts to Starter Code

highway in Lao PDR. The college is 200 meters away. Site visitors roars, smoke from burning rubbish drifts throughout the trail, and kids stroll straight by way of it. What are they respiration at present? With out native knowledge, nobody actually is aware of.

Throughout East Asia and the Pacific, 325 million kids [1] breathe poisonous air daily, generally at ranges 10 instances above secure limits. The harm is usually silent: affected lungs, bronchial asthma, however it might result in missed faculty days in acute circumstances. The futures are at stake. In the long term, the well being methods are strained, and economies should bear the prices.

In lots of circumstances, air high quality knowledge just isn’t even out there.

No displays. No proof. No safety.

On this second a part of the weblog sequence [2], we examine the information repositories the place helpful air-quality knowledge is out there, tips on how to import them, and tips on how to get them up and operating in your pocket book. We might additionally demystify knowledge codecs equivalent to GeoJSON, Parquet/GeoParquet, NetCDF/HDF5, COG, GRIB, and Zarr so you’ll be able to decide the correct software for the job. We are constructing it up in order that within the subsequent half, we will go step-by-step by way of how we developed an open-source air high quality mannequin.

In the previous couple of years, there was a big push to generate and use air-quality knowledge. These knowledge come from totally different sources, and their high quality varies accordingly. Just a few repositories will help quantify them: regulatory stations for floor reality, neighborhood sensors to know hyperlocal variations, satellites for regional context, and mannequin reanalyses for estimates (Determine 2). The excellent news: most of that is open. The higher information: the code to get began is comparatively brief.

Repository quick-starts (with minimal Python)

On this part, we transfer from ideas to observe. Under, we stroll by way of a set of generally used open-source repositories and present the smallest doable code it’s essential begin pulling knowledge from every of them. All examples assume Python ≥3.10 with pip set up as wanted.

For every numbered repository, one can find:

a brief description of what the information supply is and the way it’s maintained,

typical use-cases (when this supply is an efficient match),

tips on how to entry it (API keys, sign-up notes, or direct URLs), and

a minimal Python code snippet to extract knowledge.

Think about this as a sensible information the place you’ll be able to skim the descriptions, decide the supply that matches your downside, after which adapt the code to plug immediately into your individual evaluation or mannequin pipeline.

Tip: Maintain secrets and techniques out of code. Use setting variables for tokens (e.g., export AIRNOW_API_KEY=…).

1) OpenAQ (international floor measurements; open API)

OpenAQ [3] is an open-source knowledge platform that hosts international knowledge for air high quality knowledge, equivalent to PM2.5, PM10, and O3. They supply air high quality knowledge by partnering with varied governmental companions, neighborhood companions, and air high quality sensor corporations equivalent to Air Gradient, IQAir, amongst others.

Nice for: fast cross-country pulls, harmonised items/metadata, reproducible pipelines.

Join an OpenAQ API key at https://discover.openaq.org. After signing up, discover your API key in your settings. Use this key to authenticate requests.

!pip set up openaq pandas

import pandas as pd
from pandas import json_normalize
from openaq import OpenAQ
import datetime
from datetime import timedelta
import geopandas as gpd
import requests
import time
import json

# comply with the quickstart to get the api key https://docs.openaq.org/using-the-api/quick-start
api_key = '' #enter you API Key earlier than executing
consumer = OpenAQ(api_key=api_key) #use the API key generated earlier

# get the places of each sensors within the chosen international locations codes: https://docs.openaq.org/assets/international locations
places = consumer.places.record(
countries_id=[68,111],
restrict = 1000
)

data_locations = places.dict()
df_sensors_country = json_normalize(data_locations ['results'])
df_sensors_exploded = df_sensors_country.explode('sensors')
df_sensors_exploded['sensor_id']=df_sensors_exploded['sensors'].apply(lambda x: x['id'])
df_sensors_exploded['sensor_type']=df_sensors_exploded['sensors'].apply(lambda x: x['name'])
df_sensors_pm25 = df_sensors_exploded[df_sensors_exploded['sensor_type'] == "pm25 µg/m³"]
df_sensors_pm25

# undergo every location and extract the  hourly measurements
df_concat_aq_data=pd.DataFrame()
to_date = datetime.datetime.now()
from_date = to_date - timedelta(days=2) # get the previous 2 days knowledge
sensor_list = df_sensors_pm25.sensor_id

for sensor_id in sensor_list[0:5]:
    print("-----")
    response = consumer.measurements.record(
        sensors_id= sensor_id,
        datetime_from = from_date,
        datetime_to = to_date,
        restrict = 500 )
    print(response)

    data_measurements = response.dict()
    df_hourly_data = json_normalize(data_measurements ['results'])
    df_hourly_data["sensor_id"] = sensor_id
    if len(df_hourly_data) > 0:
        df_concat_aq_data=pd.concat([df_concat_aq_data,df_hourly_data])
        df_concat_aq_data = df_concat_aq_data[["sensor_id","period.datetime_from.utc","period.datetime_to.utc","parameter.name","value"]]

    df_concat_aq_data

2) EPA AQS Information Mart (U.S. regulatory archive; token wanted)

The EPA AQS Information Mart [4] is a U.S. regulatory knowledge archive that hosts quality-controlled air-quality measurements from 1000’s of monitoring stations throughout the nation. It offers long-term data for standards pollution equivalent to PM₂․₅, PM₁₀, O₃, NO₂, SO₂, and CO, together with detailed website metadata and QA flags, and is freely accessible through an API when you register and acquire an entry token. It offers meteorological knowledge as properly.

Nice for: authoritative QA/QC-d U.S. knowledge.

Join an AQS Information Mart account on the US EPA web site at: https://aqs.epa.gov/aqsweb/paperwork/data_api.html
Create a .env file in your setting and add your credentials, together with AQS e-mail and AQS key.

# pip set up requests pandas 

import os, requests, pandas as pd 
AQS_EMAIL = os.getenv("AQS_EMAIL") 
AQS_KEY   = os.getenv("AQS_KEY") 

url = "https://aqs.epa.gov/knowledge/api/sampleData/byState" 
params = {"e-mail": AQS_EMAIL, "key": AQS_KEY, "param": "88101", "b date":"20250101", "edate": "20250107", "state": "06"} 
r = requests.get(url, params=params, timeout=60) 

df = pd.json_normalize(r.json()["Data"]) 
print(df[["state_name","county_name","date_local","sample_measurement","units_of_measure"]].head())

3) AirNow (U.S. real-time indices; API key)

AirNow [5] is a U.S. authorities platform that gives close to real-time air-quality index (AQI) info primarily based on regulatory monitoring knowledge. It publishes present and forecast AQI values for pollution equivalent to PM₂․₅ and O₃, together with class breakpoints (“Good”, “Reasonable”, and so on.) which are straightforward to speak to the general public. Information may be accessed programmatically through the AirNow API when you register and acquire an API key.

Nice for: wildfire and public-facing AQI visuals.

From the Log In web page, choose “Request an AirNow API Account” and full the registration type together with your e-mail and fundamental particulars. After you activate your account, you will discover your API key in your AirNow API dashboard; use this key to authenticate all calls to the AirNow net providers.

import os, requests, pandas as pd 

API_KEY = os.getenv("AIRNOW_API_KEY") 
url = "https://www.airnowapi.org/aq/statement/latLong/present/" 
params = {"format":"software/json", "latitude": 37.7749, "longitude": -122.4194, "distance":25, "API_KEY": API_KEY} 
df = pd.DataFrame(requests.get(url, params=params, timeout=30).json()) 

print(df[["ParameterName", "AQI" ,"Category.Name ","DateObserved", "HourObserved"]])

4) Copernicus Environment Monitoring Service (CAMS; Environment Information Retailer)

The Copernicus Environment Monitoring Service [6], carried out by ECMWF for the EU’s Copernicus programme, offers international reanalyses and near-real-time forecasts of atmospheric composition. By the Environment Information Retailer (ADS), you’ll be able to entry gridded fields for aerosols, reactive gases (O₃, NO₂, and so on.), greenhouse gases and associated meteorological variables, with multi-year data appropriate for each analysis and operational purposes. All CAMS merchandise within the ADS are open and freed from cost, topic to accepting the Copernicus licence.

Nice for: international background fields (aerosols & hint gases), forecasts and reanalyses.

Find out how to register and get API entry

Go to the Environment Information Retailer: https://advertisements.environment.copernicus.eu.

Click on Login / Register within the top-right nook and create a (free) Copernicus/ECMWF account.

After confirming your e-mail, log in and go to your profile web page to seek out your ADS API key (UID + key).

Comply with the ADS “Find out how to use the API” directions to create a configuration file (usually ~/.cdsapirc) with:

url: https://advertisements.environment.copernicus.eu/api

key: :

On the internet web page of every CAMS dataset you wish to use, go to the Obtain knowledge tab and settle for the licence on the backside as soon as; solely then will API requests for that dataset succeed.

As soon as that is arrange, you need to use the usual cdsapi Python consumer to programmatically obtain CAMS datasets from the ADS.

# pip set up cdsapi xarray cfgrib 

import cdsapi 
c = cdsapi.Shopper() 

# Instance: CAMS international reanalysis (EAC4) complete column ozone (toy instance) 
c.retrieve( 
    "cams-global-reanalysis-eac4", 
    {"variable":"total_column_ozone","date":"2025-08-01/2025-08-02","time":["00:00","12:00"], 
     "format":"grib"}, "cams_ozone.grib")

5) NASA Earthdata (LAADS DAAC / GES DISC; token/login)

NASA Earthdata [7] offers unified sign-on entry to a variety of Earth science knowledge, together with satellite tv for pc aerosol and hint fuel merchandise which are essential for air-quality purposes. Two key centres for atmospheric composition are:

LAADS DAAC (Stage-1 and Environment Archive and Distribution System DAAC), which hosts MODIS, VIIRS and different instrument merchandise (e.g., AOD, cloud, fireplace, radiance).

GES DISC (Goddard Earth Sciences Information and Info Companies Middle), which serves mannequin and satellite tv for pc merchandise equivalent to MERRA-2 reanalysis, OMI, TROPOMI, and associated atmospheric datasets.

Most of those datasets are free to make use of however require a NASA Earthdata Login; downloads are authenticated both through HTTP fundamental auth (username/password saved in .netrc) or through a private entry token (PAT) in request headers.

Nice for: MODIS/VIIRS AOD, MAIAC, TROPOMI trace-gas merchandise.

Find out how to register and get API/obtain entry:

Create a NASA Earthdata Login account at:
https://urs.earthdata.nasa.gov

Verify your e-mail and log in to your Earthdata profile.

Beneath your profile, generate a private entry token (PAT). Save this token securely; you need to use it in scripts through an Authorization: Bearer header or in instruments that help Earthdata tokens.

For traditional wget/curl-based downloads, you’ll be able to alternatively create a ~/.netrc file to retailer your Earthdata username and password, for instance:

machine urs.earthdata.nasa.gov 
login  
password

Then set file permissions to user-only (chmod 600 ~/.netrc) so command-line instruments can authenticate robotically.

For LAADS DAAC merchandise, go to https://ladsweb.modaps.eosdis.nasa.gov, log in together with your Earthdata credentials, and use the Search & Obtain interface to construct obtain URLs; you’ll be able to copy the auto-generated wget/curl instructions into your scripts.

For GES DISC datasets, begin from https://disc.gsfc.nasa.gov, select a dataset (e.g., MERRA-2), and use the “Information Entry” or “Subset/Get Information” instruments. The positioning can generate script templates (Python, wget, and so on.) that already embody the right endpoints for authenticated entry.

As soon as your Earthdata Login and token are arrange, LAADS DAAC and GES DISC behave like customary HTTPS APIs: you’ll be able to name them from Python (e.g., with requests, xarray + pydap/OPeNDAP, or s3fs for cloud buckets) utilizing your credentials or token for authenticated, scriptable downloads.

#Downloads through HTTPS with Earthdata login. 

# pip set up requests 
import requests 
url = "https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/6/MCD19A2/2025/214/MCD19A2.A2025214.h21v09.006.2025xxxxxx.hdf" 

# Requires a legitimate token cookie; advocate utilizing .netrc or requests.Session() with auth 
# See NASA docs for token-based obtain; right here we solely illustrate the sample: 
# s = requests.Session(); s.auth = (USERNAME, PASSWORD); r = s.get(url)

6) STAC catalogues (search satellites programmatically)

SpatioTemporal Asset Catalog (STAC) [8] is an open specification for describing geospatial property, equivalent to satellite tv for pc scenes, tiles, and derived merchandise, in a constant, machine-readable approach. As an alternative of manually shopping obtain portals, you question a STAC API with filters like time, bounding field, cloud cowl, platform (e.g., Sentinel-2, Landsat-8, Sentinel-5P), or processing degree, and get again JSON gadgets with direct hyperlinks to COGs, NetCDF, Zarr, or different property.

Nice for: uncover and stream property (COGs/NetCDF) with out bespoke APIs and works properly with Sentinel-5P, Landsat, Sentinel-2, extra.

Find out how to register and get API entry:
STAC itself is simply an ordinary; entry is dependent upon the particular STAC API you utilize:

Many public STAC catalogues (e.g., demo or analysis endpoints) are absolutely open and require no registration—you’ll be able to hit their /search endpoint immediately with HTTP POST/GET.

Some cloud platforms that expose STAC (for instance, industrial or giant cloud suppliers) require you to create a free account and acquire credentials earlier than you’ll be able to learn the underlying property (e.g., blobs in S3/Blob storage), though the STAC metadata is open.

A generic sample you’ll be able to describe is:

Decide a STAC API endpoint for the satellite tv for pc knowledge you care about (typically documented as one thing alongside the strains of https:///stac or …/stac/search).

If the supplier requires sign-up, create an account of their portal and acquire the API key or storage credentials they advocate (this may be a token, SAS URL, or cloud entry position).

Use a STAC consumer library in Python (for instance, pystac-client) to go looking {the catalogue}:

# pip set up pystac-client 
from pystac_client import Shopper 

api = Shopper.open("https://instance.com/stac") 
search = api.search( 
    collections=["sentinel-2-l2a"], 
    bbox=[102.4, 17.8, 103.0, 18.2],   # minx, miny, maxx, maxy 
    datetime="2024-01-01/2024-01-31", 
    question={"eo:cloud_cover": {"lt": 20}}, 
    )
gadgets = record(search.get_items()) 
first_item = gadgets[0] 
property = first_item.property  # e.g., COGs, QA bands, metadata

For every returned STAC merchandise, comply with the asset href hyperlinks (typically HTTPS URLs or cloud URIs like s3://…) and skim them with the applicable library (rasterio/xarray/zarr and so on.). If credentials are wanted, configure them through setting variables or your cloud SDK as per the supplier’s directions.

As soon as arrange, STAC catalogues provide you with a uniform, programmatic strategy to search and retrieve satellite tv for pc knowledge throughout totally different suppliers, with out rewriting your search logic each time you turn from one archive to a different.

# pip set up pystac-client planetary-computer rasterio 
from pystac_client import Shopper 
from shapely.geometry import field, mapping 
import geopandas as gpd 

catalog = Shopper.open("https://earth-search.aws.element84.com/v1") 
aoi = mapping(field(-0.3, 5.5, 0.3, 5.9))  # bbox round Accra
search = catalog.search(collections=["sentinel-2-l2a"], intersects=aoi, restrict=5) 
gadgets = record(search.get_items()) 
for it in gadgets: 
    print(it.id, record(it.property.keys())[:5])   # e.g., "B04", "B08", "SCL", "visible"

It’s preferrable to make use of STAC the place doable as they supply clear metadata, cloud-optimised property, and straightforward filtering by time/house.

7) Google Earth Engine (GEE; quick prototyping at scale)

Google Earth Engine [9] is a cloud-based geospatial evaluation platform that hosts a big catalogue of satellite tv for pc, local weather, and land-surface datasets (e.g., MODIS, Landsat, Sentinel, reanalyses) and allows you to course of them at scale with out managing your individual infrastructure. You write brief scripts in JavaScript or Python, and GEE handles the heavy lifting like knowledge entry, tiling, reprojection, and parallel computation thus making it superb for quick prototyping, exploratory analyses, and instructing.

Nevertheless, GEE itself just isn’t open supply: it’s a proprietary, closed platform the place the underlying codebase just isn’t publicly out there. This has implications for open, reproducible workflows mentioned within the first Air for Tomorrow weblog [add link]:

Nice for: testing fusion/downscaling over a metropolis/area utilizing petabyte-scale datasets.

Find out how to register and get entry

Go to the Earth Engine sign-up web page: https://earthengine.google.com.

Sign up with a Google account and full the non-commercial sign-up type, describing your meant use (analysis, schooling, or private, non-commercial tasks).

As soon as your account is accepted, you’ll be able to:

use the browser-based Code Editor to jot down JavaScript Earth Engine scripts; and

allow the Earth Engine API in Google Cloud and set up the earthengine-api Python bundle (pip set up earthengine-api) to run workflows from Python notebooks.

When sharing your work, take into account exporting key intermediate outcomes (e.g., GeoTIFF/COG, NetCDF/Zarr) and documenting your processing steps in open-source code in order that others can re-create the evaluation with out relying solely on GEE.

When used this manner, Earth Engine turns into a robust “fast laboratory” for testing concepts, which you’ll be able to then harden into absolutely open, moveable pipelines for manufacturing and long-term stewardship.

# pip set up earthengine-api 
import ee 

ee.Initialize()  # first run: ee.Authenticate() in a console 
s5p = ee.ImageCollection('COPERNICUS/S5P/OFFL/L3_NO2').choose('NO2_column_number_density') 
       .filterDate('2025-08-01', '2025-08-07').imply() 

print(s5p.getInfo()['bands'][0]['id']) 

# Exporting and visualization occur inside GEE; you'll be able to pattern to a grid then .getDownloadURL()

8) HIMAWARI

Himawari-8 and Himawari-9 are geostationary meteorological satellites operated by the Japan Meteorological Company (JMA). Their Superior Himawari Imager (AHI) offers multi-band seen, near-infrared and infrared imagery over East Asia and the western–central Pacific, with full-disk scans each 10 minutes and even quicker refresh over goal areas. This high-cadence view is extraordinarily helpful for monitoring smoke plumes, mud, volcanic eruptions, convective storms and the diurnal evolution of clouds—precisely the sorts of processes that modulate near-surface air high quality.

Nice for: monitoring diurnal haze/smoke plumes and fireplace occasions, producing high-frequency AOD to fill polar-orbit gaps, and fast situational consciousness for cities throughout SE/E Asia (through JAXA P-Tree L3 merchandise).

Find out how to entry and register

Choice A – Open archive through NOAA on AWS (no sign-up required)

Browse the dataset description on the AWS Registry of Open Information: https://registry.opendata.aws/noaa-himawari/

Himawari-8 and Himawari-9 imagery are hosted in public S3 buckets (s3://noaa-himawari8/ and s3://noaa-himawari9/). As a result of the buckets are world-readable, you’ll be able to record or obtain information anonymously, for instance:

aws s3 ls --no-sign-request s3://noaa-himawari9/

or entry particular person objects through HTTPS (e.g., https://noaa-himawari9.s3.amazonaws.com/…).

For Python workflows, you need to use libraries like s3fs, fsspec, xarray, or rasterio to stream knowledge immediately from these buckets with out prior registration, holding in thoughts the attribution steering from JMA/NOAA if you publish outcomes.

Choice B – JAXA Himawari Monitor / P-Tree (analysis & schooling account)

Go to the JAXA Himawari Monitor / P-Tree portal:
https://www.eorc.jaxa.jp/ptree/

Click on Person Registration / Account request and skim the “Precautions” and “Phrases of Use”. Information entry is restricted to non-profit functions equivalent to analysis and schooling; industrial customers are directed to the Japan Meteorological Enterprise Assist Middle.

Submit your e-mail tackle within the account request type. You’ll obtain a short lived acceptance e-mail, then a hyperlink to finish your consumer info. After handbook evaluate, JAXA allows your entry and notifies you as soon as you’ll be able to obtain Himawari Customary Information and geophysical parameter merchandise.

As soon as accepted, you’ll be able to log in to obtain near-real-time and archived Himawari knowledge through the P-Tree FTP/HTTP providers, following JAXA’s steering on non-redistribution and quotation.

In observe, a standard sample is to make use of the NOAA/AWS buckets for open, scriptable entry to uncooked imagery, and the JAXA P-Tree merchandise if you want value-added parameters (e.g., cloud or aerosol properties) and are working inside non-profit analysis or instructional tasks.

# open the downloaded file
!pip set up xarray netCDF4
!pip set up rasterio polars_h3
!pip set up geopandas pykrige
!pip set up polars==1.25.2
!pip set up dask[complete] rioxarray h3==3.7.7
!pip set up h3ronpy==0.21.1
!pip set up geowrangler

# Himawari utilizing – JAXA Himawari Monitor / P-Tree
# create your account right here and use the username and password despatched by e-mail - https://www.eorc.jaxa.jp/ptree/registration_top.html

consumer = '' # enter the username 
password = '' # enter the password

from ftplib import FTP
from pathlib import Path
import rasterio
from rasterio.rework import from_origin
import xarray as xr
import os
import matplotlib.pyplot as plt


def get_himawari_ftp_past_2_days(consumer, password):

    # FTP connection particulars
    ftp = FTP('ftp.ptree.jaxa.jp')
    ftp.login(consumer=consumer, passwd=password)

    # test the listing content material : /pub/himawari/L2/ARP/031/
    # particulars of AOD directoty right here: https://www.eorc.jaxa.jp/ptree/paperwork/README_HimawariGeo_en.txt

    overall_path= "/pub/himawari/L3/ARP/031/"
    directories = overall_path.strip("/").break up("/")

    for listing in directories:
      ftp.cwd(listing)

    # Checklist information within the goal listing
    date_month_files = ftp.nlst()

    # order information desc
    date_month_files.kind(reverse=False)
    print("Information in goal listing:", date_month_files)

    # get a listing of all of the month / days throughout the "/pub/himawari/L3/ARP/031/" path throughout the previous 2 months
    limited_months_list = date_month_files[-2:]

    i=0
    # for every month within the limited_months_list, record all the times inside in
    for month in limited_months_list:
      ftp.cwd(month)
      date_day_files = ftp.nlst()
      date_day_files.kind(reverse=False)


      # mix every component of the date_day_file record with the month : month +"/" + date_day_file
      list_combined_days_month_inter = [month + "/" + date_day_file for date_day_file in date_day_files]
      if i ==0:
        list_combined_days_month= list_combined_days_month_inter
        i=i+1
      else:
        list_combined_days_month= list_combined_days_month + list_combined_days_month_inter
      ftp.cwd("..")

    # take away all parts containing every day or month-to-month from list_combined_days_month
    list_combined_days_month = [item for item in list_combined_days_month if 'daily' not in item and 'monthly' not in item]

    # get the record of days we wish to obtain : in our case final 2 days - for NRT
    limited_list_combined_days_month=list_combined_days_month[-2:]


    for month_day_date in limited_list_combined_days_month:
      #navigate to the related listing
      ftp.cwd(month_day_date)
      print(f"listing: {month_day_date}")

      # get the record of the hourly information inside every listing
      date_hour_files = ftp.nlst()
      !mkdir -p ./raw_data/{month_day_date}

      #for every hourly file within the record
      for date_hour_file in date_hour_files:
        target_file_path=f"./raw_data/{month_day_date}/{date_hour_file}"
        # Obtain the goal file - provided that it doesn't exist already

        if not os.path.exists(target_file_path):
            with open(target_file_path, "wb") as local_file:
              ftp.retrbinary(f"RETR {date_hour_file}", local_file.write)
              print(f"Downloaded {date_hour_file} efficiently!")
        else:
            print(f"File already exists: {date_hour_file}")



      print("--------------")
      # return 2 steps within the ftp tree
      ftp.cwd("..")
      ftp.cwd("..")

def transform_to_tif():
    # get record of information in raw_data folder
    month_file_list = os.listdir("./raw_data")
    month_file_list

    #order month_file_list
    month_file_list.kind(reverse=False)

    nb_errors=0
    # get record of every day folder for the previous 2 months solely

    for month_file in month_file_list[-2:]:
        print(f"-----------------------------------------")
        print(f"Month thought of: {month_file}")
        date_file_list=os.listdir(f"./raw_data/{month_file}")
        date_file_list.kind(reverse=False)

        # get record of information for every day folder

        for date_file in date_file_list[-2:]:
            print(f"---------------------------")
            print(f"Day thought of: {date_file}")
            hour_file_list=os.listdir(f"./raw_data/{month_file}/{date_file}")
            hour_file_list.kind(reverse=False)

            #course of every hourly file right into a tif file and rework it into an h3 processed dataframe
            for hour_file in hour_file_list:
                file_path = f"./raw_data/{month_file}/{date_file}/{hour_file}"
                hour_file_tif=hour_file.exchange(".nc",".tif")
                output_tif = f"./tif/{month_file}/{date_file}/{hour_file_tif}"
                if os.path.exists(output_tif):
                   print(f"File already exists: {output_tif}")
                else:

                   attempt:
                      dataset = xr.open_dataset(file_path, engine='netcdf4')
                   besides:
                      #go to subsequent hour_file
                      print(f"error opening {hour_file} file - skipping ")
                      nb_errors=nb_errors+1
                      proceed

                   # Entry a selected variable
                   variable_name = record(dataset.data_vars.keys())[1] # Merged AOT product
                   knowledge = dataset[variable_name]

                   # Plot knowledge (if it is 2D and appropriate)
                   plt.determine()
                   knowledge.plot()
                   plt.title(f'{date_file}')
                   plt.present()

                   # Extract metadata (exchange with precise coordinates out of your knowledge if out there)
                   lon = dataset['longitude'] if 'longitude' in dataset.coords else None
                   lat = dataset['latitude'] if 'latitude' in dataset.coords else None

                   # Deal with lacking lat/lon (instance assumes evenly spaced grid)
                   if lon is None or lat is None:
                        lon_start, lon_step = -180, 0.05 # Instance values
                        lat_start, lat_step = 90, -0.05 # Instance values
                        lon = xr.DataArray(lon_start + lon_step * vary(knowledge.form[-1]), dims=['x'])
                        lat = xr.DataArray(lat_start + lat_step * vary(knowledge.form[-2]), dims=['y'])

                   # Outline the affine rework for georeferencing
                   rework = from_origin(lon.min().merchandise(), lat.max().merchandise(), abs(lon[1] - lon[0]).merchandise(), abs(lat[0] - lat[1]).merchandise())

                   # Save to GeoTIFF
                   !mkdir -p ./tif/{month_file}/{date_file}

                   with rasterio.open(
                   output_tif,
                   'w',
                   driver='GTiff',
                   peak=knowledge.form[-2],
                   width=knowledge.form[-1],
                   rely=1, # Variety of bands
                   dtype=knowledge.dtype.identify,
                   crs='EPSG:4326', # Coordinate Reference System (e.g., WGS84)
                   rework=rework
                   ) as dst:

                        dst.write(knowledge.values, 1) # Write the information to band 1
                   print(f"Saved {output_tif} efficiently!")
                   print(f"{nb_errors} error(s) ")

get_himawari_ftp_past_2_days(consumer, password)
transform_to_tif()

9) NASA — FIRMS [Special Highlight]

NASA’s Hearth Info for Useful resource Administration System (FIRMS) [10] offers near-real-time info on energetic fires and thermal anomalies detected by devices equivalent to MODIS and VIIRS. It affords international protection with low latency (on the order of minutes to hours), supplying attributes equivalent to fireplace radiative energy, confidence, and acquisition time. FIRMS is broadly used for wildfire monitoring, agricultural burning, forest administration, and as a proxy enter for air-quality and smoke dispersion modelling.

Nice for: pinpointing fireplace hotspots that drive AQ spikes, monitoring plume sources and fire-line development, monitoring crop-residue/forest burns, and triggering fast response. Easy accessibility through CSV/GeoJSON/Shapefile, map tiles/API, with 24–72 h rolling feeds and full archives for seasonal evaluation.

Find out how to register and get API entry

Create a free NASA Earthdata Login account at:
https://urs.earthdata.nasa.gov

Verify your e-mail and register together with your new credentials.

Go to the FIRMS website you propose to make use of, for instance:

Click on Login (high proper) and authenticate together with your Earthdata username and password. As soon as logged in, you’ll be able to:

customise map views and obtain choices from the net interface, and

generate or use FIRMS Internet Companies/API URLs that honour your authenticated session.

For scripted entry, you’ll be able to name the FIRMS obtain or net service endpoints (e.g., GeoJSON, CSV) utilizing customary HTTP instruments (e.g., curl, requests in Python). If an endpoint requires authentication, provide your Earthdata credentials through a .netrc file or session cookies, as you’d for different Earthdata providers.

In observe, FIRMS is a handy strategy to pull current fireplace places into an air-quality workflow: you’ll be able to fetch every day or hourly fireplace detections for a area, convert them to a GeoDataFrame, after which intersect with wind fields, inhabitants grids, or sensor networks to know potential smoke impacts.

#FIRMS  
!pip set up geopandas rtree shapely

import pandas as pd 
import geopandas as gpd 
from shapely.geometry import Level 
import numpy as np 
import matplotlib.pyplot as plt 
import rtree 

# get boundaries of Thailand 
boundaries_country = gpd.read_file(f'https://github.com/wmgeolab/geoBoundaries/uncooked/fcccfab7523d4d5e55dfc7f63c166df918119fd1/releaseData/gbOpen/THA/ADM0/geoBoundaries-THA-ADM0.geojson') 
boundaries_country.plot() 

# Actual time knowledge supply: https://corporations.modaps.eosdis.nasa.gov/active_fire/ 
# Previous 7 days hyperlinks: 
modis_7d_url= "https://corporations.modaps.eosdis.nasa.gov/knowledge/active_fire/modis-c6.1/csv/MODIS_C6_1_SouthEast_Asia_7d.csv" 
suomi_7d_url= "https://corporations.modaps.eosdis.nasa.gov/knowledge/active_fire/suomi-npp-viirs-c2/csv/SUOMI_VIIRS_C2_SouthEast_Asia_7d.csv" 
j1_7d_url= "https://corporations.modaps.eosdis.nasa.gov/knowledge/active_fire/noaa-20-viirs-c2/csv/J1_VIIRS_C2_SouthEast_Asia_7d.csv" 
j2_7d_url="https://corporations.modaps.eosdis.nasa.gov/knowledge/active_fire/noaa-21-viirs-c2/csv/J2_VIIRS_C2_SouthEast_Asia_7d.csv" 
urls = [modis_7d_url, suomi_7d_url, j1_7d_url, j2_7d_url] 

# Create an empty GeoDataFrame to retailer the mixed knowledge 
gdf = gpd.GeoDataFrame() 

for url in urls: 
    df = pd.read_csv(url) 

    # Create a geometry column from latitude and longitude 
    geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])] 
    gdf_temp = gpd.GeoDataFrame(df, crs="EPSG:4326", geometry=geometry)
     
    # Concatenate the short-term GeoDataFrame to the primary GeoDataFrame 
    gdf = pd.concat([gdf, gdf_temp], ignore_index=True) 

# Filter to maintain solely fires throughout the nation boundaries 
gdf = gpd.sjoin(gdf, boundaries_country, how="inside", predicate="inside") 

# Show fires on map  
frp = gdf["frp"].astype(float) 
fig, ax = plt.subplots(figsize=(9,9)) 
boundaries_country.plot(ax=ax, facecolor="none", edgecolor="0.3", linewidth=0.8) 
gdf.plot(ax=ax, markersize=frp, shade="crimson", alpha=0.55) 
ax.set_title("Fires inside nation boundaries (bubble dimension = Hearth Radiative Energy )") 
ax.set_axis_off() 
plt.present()

Information sorts you will meet (and tips on how to learn them proper)

Air-quality work hardly ever lives in a single, tidy CSV. So, it helps to know what the file sorts you’ll meet. You’ll transfer between multidimensional mannequin outputs (NetCDF/GRIB/Zarr), satellite tv for pc rasters (COG/GeoTIFF), level measurements (CSV /Parquet /GeoParquet), and web-friendly codecs (JSON/GeoJSON), typically in the identical pocket book.

This part is a fast subject information to these codecs and tips on how to open them with out getting caught.

There isn’t any have to memorise any of this, so be at liberty to skim the record as soon as, then come again if you hit an unfamiliar file extension within the wild.

NetCDF4 / HDF5 (self-describing scientific arrays): Extensively used for reanalyses, satellite tv for pc merchandise, and fashions. Wealthy metadata, multi-dimensional (time, degree, lat, lon) Typical extensions: .nc, .nc4, .h5, .hdf5

Learn:

# pip set up xarray netCDF4 

import xarray as xr 
ds = xr.open_dataset("modis_aod_2025.nc") 
ds = ds.sel(time=slice("2025-08-01","2025-08-07")) 
print(ds)

Cloud-Optimised GeoTIFF (COG): Raster format tuned for HTTP vary requests (stream simply what you want). Widespread for satellite tv for pc imagery and gridded merchandise. Typical extensions: .tif, .tiff

Learn:

# pip set up rasterio 

import rasterio 
with rasterio.open("https://example-bucket/no2_mean_2025.tif") as src: 
    window = rasterio.home windows.from_bounds(*(-0.3,5.5,0.3,5.9), src.rework) 
    arr = src.learn(1, window=window)

JSON (nested) & GeoJSON (options + geometry): Nice for APIs and light-weight geospatial. GeoJSON makes use of WGS84 (EPSG:4326) by default. Typical extensions: json, .jsonl, .ndjson, .geojsonl, .ndgeojson

Learn:

# pip set up geopandas 

import geopandas as gpd 
gdf = gpd.read_file("factors.geojson")  # columns + geometry 
gdf = gdf.set_crs(4326)                # guarantee WGS84

GRIB2 (meteorology, mannequin outputs): Compact, tiled; typically utilized by CAMS/ECMWF/NWP. Typical extensions: .grib2, .grb2, .grib, .grb. In observe, knowledge suppliers typically add compression suffixes too, e.g. .grib2.gz or .grb2.bz2.

Learn:

# pip set up xarray cfgrib 

import xarray as xr 
ds = xr.open_dataset("cams_ozone.grib", engine="cfgrib")

Parquet & GeoParquet (columnar, compressed): Greatest for large tables: quick column choice, predicate pushdown, partitioning (e.g., by date/metropolis). GeoParquet provides an ordinary for geometries. Typical extensions: .parquet, .parquet.gz

Learn/Write:

# pip set up pandas pyarrow geopandas geoparquet 

import pandas as pd, geopandas as gpd 
df = pd.read_parquet("openaq_accra_2025.parquet")   # columns solely 

# Convert a GeoDataFrame -> GeoParquet 
gdf = gpd.read_file("factors.geojson") 
gdf.to_parquet("factors.geoparquet")  # preserves geometry & CRS

CSV/TSV (textual content tables): Easy, common. Weak at giant scale (gradual I/O, no schema), no geometry. Typical extensions: .csv, .tsv (additionally generally .tab, much less widespread)

Learn:

# pip set up pandas 

import pandas as pd
df = pd.read_csv("measurements.csv", parse_dates=["datetime"], dtype={"site_id":"string"})

Zarr (chunked, cloud-native): Supreme for evaluation within the cloud with parallel reads (works nice with Dask). Typical extension: .zarr (typically a listing / retailer ending in .zarr; sometimes packaged as .zarr.zip)

Learn:

# pip set up xarray zarr s3fs 

import xarray as xr
ds = xr.open_zarr("s3://bucket/cams_eac4_2025.zarr", consolidated=True)

Observe: Shapefile (legacy vector): Works, however brittle (many information, 10-char subject restrict). . It is a legacy codecs and it’s higher to make use of the alternate options like GeoPackage or GeoParquet

It is very important select the correct geospatial (or scientific) file format as it’s not only a storage determination nevertheless it immediately impacts how rapidly you’ll be able to learn knowledge, software compatibility, how simply you’ll be able to share it, and the way properly it scales from a desktop workflow to cloud-native processing. The next desk (Desk 1) offers a sensible “format-to-task” cheat sheet: for every widespread want (from fast API dumps to cloud-scale arrays and net mapping), it lists essentially the most appropriate format, the extensions you’ll usually encounter, and the core motive that format is an efficient match. It may be used as a default place to begin when designing pipelines, publishing datasets, or choosing what to obtain from an exterior repository.

Want	Greatest Wager	Typical Extension	Why
Human-readable logs or fast API dumps	CSV/JSON	.csv, .json (additionally .jsonl, .ndjson)	Ubiquitous, straightforward to examine
Large tables (tens of millions of rows)	Parquet/ GeoParquet	.parquet	Quick scans, column pruning, and partitioning
Giant rasters over HTTP	COG	.tif, .tiff	Vary requests; no full obtain
Multi-dimensional scientific knowledge	NetCDF4/HDF5	.nc, .nc4, .h5, .hdf5	Self-describing, items/attrs
Meteorological mannequin outputs	GRIB2	.grib2, .grb2, .grib, .grb	Compact, broadly supported by wx instruments
Cloud-scale arrays	Zarr	.zarr	Chunked + parallel; cloud-native
Exchangeable vector file	GeoPackage	.gpkg	Single file; strong
Internet mapping geometries	GeoJSON	.geojsonl, .ndgeojson	Easy; native to net stacks

Desk 1: Choosing the right format for the job

Tip: An attention-grabbing discuss on STAC and knowledge sorts (particularly GeoParquet): https://github.com/GSA/gtcop-wiki/wiki/June-2025:-GeoParquet,-Iceberg-and-CloudpercentE2percent80percent90Native-Spatial-Information-Infrastructures

A number of open STAC catalogues are actually out there, together with public endpoints for optical, radar, and atmospheric merchandise (for instance, Landsat and Sentinel imagery through suppliers equivalent to Factor 84’s Earth Search or Microsoft’s Planetary Pc). STAC makes it a lot simpler to script “discover and obtain all scenes for this polygon and time vary” and to combine totally different datasets into the identical workflow.

Conclusion — from “the place” the information lives to “how” you utilize it

Determine 3: Creating publicity maps from hotspots © UNICEF/UNI724381/Kongchan Phi. All rights reserved.

Air for Tomorrow: We began with the query “What are these children respiration at present?” This publish offers a sensible path and instruments that can assist you reply this query. You now know the place open-air high quality knowledge resides, together with regulatory networks, neighborhood sensors, satellite tv for pc measurements, and reanalysis. You additionally perceive what these information are (GeoJSON, Parquet/GeoParquet, NetCDF/HDF5, COG, GRIB, Zarr) and tips on how to retrieve them with compact, reproducible snippets. The aim is past simply downloading them; it’s to make defensible, quick, and shareable analyses that maintain up tomorrow.

You may assemble a reputable native image in hours, not weeks. From fireplace hotspots (Determine 2) to school-route publicity (Determine 1), you’ll be able to create publicity maps (Determine 3).

Up subsequent: We might showcase an precise Air High quality Mannequin developed by us on the UNICEF Nation Workplace of Lao PDR with the UNICEF EAPRO’s Frontier Information Workforce. We might undergo an open, end-to-end mannequin pipeline. When there are ground-level air high quality knowledge streams out there, we’d cowl how function engineering, bias correction, normalisation, and a mannequin may be developed with an actionable floor {that a} regional can use tomorrow morning.

Contributors: Prithviraj Pramanik, AQAI; Hugo Ruiz Verastegui, Anthony Mockler, Judith Hanan, Frontier Information Lab; Risdianto Irawan, UNICEF EAPRO; Soheib Abdalla, Andrew Dunbrack, UNICEF Lao PDR Nation Workplace; Halim Jun, Daniel Alvarez, Shane O’Connor, UNICEF Workplace of Innovation;

ShinyHunters declare to be behind SSO-account information theft assaults

Technology

Dr. Mike

January 24, 2026

ShinyHunters declare to be behind SSO-account information theft assaults

The ShinyHunters extortion gang claims it’s behind a wave of ongoing voice phishing assaults focusing on single sign-on (SSO) accounts at Okta, Microsoft, and Google, enabling menace actors to breach company SaaS platforms and steal firm information for extortion.

In these assaults, menace actors impersonate IT assist and name staff, tricking them into coming into their credentials and multi-factor authentication (MFA) codes on phishing websites that impersonate firm login portals.

As soon as compromised, the attackers acquire entry to the sufferer’s SSO account, which might present entry to different related enterprise functions and companies.

SSO companies from Okta, Microsoft Entra, and Google allow corporations to hyperlink third-party functions right into a single authentication move, giving staff entry to cloud companies, inside instruments, and enterprise platforms with a single login.

These SSO dashboards usually listing all related companies, making a compromised account a gateway into company programs and information.

Platforms generally related by SSO embrace Salesforce, Microsoft 365, Google Workspace, Dropbox, Adobe, SAP, Slack, Zendesk, Atlassian, and plenty of others.

**Microsoft Entra single sign-on (SSO) dashboard**
*Supply: Microsoft*

Vishing assaults used for information theft

As first reported by BleepingComputer, menace actors have been finishing up these assaults by calling staff and posing as IT employees, utilizing social engineering to persuade them to log into phishing pages and full MFA challenges in actual time.

After having access to a sufferer’s SSO account, the attackers browse the listing of related functions and start harvesting information from the platforms out there to that consumer.

BleepingComputer is conscious of a number of corporations focused in these assaults which have since obtained extortion calls for signed by ShinyHunters, indicating that the group was behind the intrusions.

BleepingComputer contacted Okta earlier this week in regards to the breaches, however the firm declined to touch upon the information theft assaults.

Nevertheless, Okta launched a report yesterday describing the phishing kits utilized in these voice-based assaults, which match what BleepingComputer has been advised.

In response to Okta, the phishing kits embrace a web-based management panel that permits attackers to dynamically change what a sufferer sees on a phishing website whereas talking to them on the telephone. This permits menace actors to information victims by every step of the login and MFA authentication course of.

If the attackers enter stolen credentials into the actual service and are prompted for MFA, they’ll show new dialog containers on the phishing website in actual time to instruct a sufferer to approve a push notification, enter a TOTP code, or carry out different authentication steps.

Phishing kit letting attackers display different dialogs while calling victims — **A phishing package lets attackers show totally different dialogs whereas calling victims**
*Supply: Okta*

ShinyHunters declare accountability

Whereas ShinyHunters declined to touch upon the assaults final evening, the group confirmed to BleepingComputer this morning that it’s accountable for a number of the social engineering assaults.

“We affirm we’re behind the assaults,” ShinyHunters advised BleepingComputer. “We’re unable to share additional particulars at the moment, moreover the truth that Salesforce stays our main curiosity and goal, the remainder are benefactors.”

The group additionally confirmed different elements of BleepingComputer’s reporting, together with particulars in regards to the phishing infrastructure and domains used within the marketing campaign. Nevertheless, it disputed {that a} screenshot of a phishing package command-and-control server shared by Okta was for its platform, claiming as an alternative that theirs was constructed in-house.

ShinyHunters claimed it’s focusing on not solely Okta but additionally Microsoft Entra and Google SSO platforms.

Microsoft stated it has nothing to share at the moment, and Google stated it had no proof its merchandise have been being abused within the marketing campaign.

“Right now, now we have no indication that Google itself or its merchandise are affected by this marketing campaign,” a Google spokesperson advised BleepingComputer.

ShinyHunters claims to be utilizing information stolen in earlier breaches, such because the widespread Salesforce information theft assaults, to establish and phone staff. This information contains telephone numbers, job titles, names, and different particulars used to make the social-engineering calls extra convincing.

Final evening, the group relaunched its Tor information leak website, which at the moment lists breaches at SoundCloud, Betterment, and Crunchbase.

SoundCloud beforehand disclosed an information breach in December 2025, whereas Betterment confirmed this month that its electronic mail platform had been abused to ship cryptocurrency scams and that information was stolen.

Crunchbase, which had not beforehand disclosed a breach, confirmed right this moment that information was stolen from its company community.

“Crunchbase detected a cybersecurity incident the place a menace actor exfiltrated sure paperwork from our company community,” an organization spokesperson advised BleepingComputer. “No enterprise operations have been disrupted by this incident. We now have contained the incident and our programs are safe.”

“Upon detecting the incident we engaged cybersecurity consultants and contacted federal legislation enforcement. We’re reviewing the impacted data to find out if any notifications are required in line with relevant authorized necessities.”

Whether or not you are cleansing up outdated keys or setting guardrails for AI-generated code, this information helps your staff construct securely from the beginning.

Get the cheat sheet and take the guesswork out of secrets and techniques administration.

1...444546...326 Page 45 of 326

Preprocessing ImageNet

Distributed Coaching

What immigration brokers have been doing in Minneapolis

# Introduction

# 1. Dokploy

# 2. Coolify

# 3. Appwrite

# 4. Dokku

# 5. Juno

# Comparability Desk

On supporting science journalism

It’s Time to Stand Up for Science

Repository quick-starts (with minimal Python)

1) OpenAQ (international floor measurements; open API)

2) EPA AQS Information Mart (U.S. regulatory archive; token wanted)

3) AirNow (U.S. real-time indices; API key)

4) Copernicus Environment Monitoring Service (CAMS; Environment Information Retailer)

5) NASA Earthdata (LAADS DAAC / GES DISC; token/login)

6) STAC catalogues (search satellites programmatically)

7) Google Earth Engine (GEE; quick prototyping at scale)

8) HIMAWARI

9) NASA — FIRMS [Special Highlight]

Information sorts you will meet (and tips on how to learn them proper)

Conclusion — from “the place” the information lives to “how” you utilize it

Introduction – What Makes Nvidia GH200 the Star of 2026?

Fast Digest: How This Information Is Structured

GH200 Structure and Reminiscence Improvements

Hybrid CPU–GPU Fusion

Unified Reminiscence Pool

HBM3e and Rubin Platform

Skilled Insights

Efficiency Benchmarks & Price Effectivity

MLPerf and Vendor Benchmarks

Actual‑World Inference (LLM and RAG)

Price‑Per‑Hour & Cloud Pricing

Reminiscence‑Certain vs Compute‑Certain Workloads

Skilled Insights

Use Instances and Workload Match

Massive Language Fashions and Chatbots

Retrieval‑Augmented Era (RAG)

Multimodal AI and Video Era

Graph Neural Networks and Suggestion Techniques

Scientific HPC and Exascale Simulations

Skilled Insights

Deployment Choices and Ecosystem

On‑Premise DGX Techniques

Hyperscaler Cases

Specialist GPU Clouds and Decentralised Markets

Clarifai Compute Platform

Skilled Insights

Choice Information: GH200 vs H100/H200 vs B200/Rubin

Reminiscence Capability & Bandwidth

Software program Stack & Structure

Energy Consumption & Cooling

Price & Availability

Choice Matrix

Skilled Insights

Challenges, Limitations and Mitigation

Software program Ecosystem on ARM

Energy and Cooling Necessities

Latency & NUMA Results

Provide Chain & Pricing

Skilled Insights

Rising Traits & Future Outlook

HBM3e and Blackwell

Rubin Platform and NVLink‑6

Exascale Supercomputers & International AI Infrastructure

Business Collaboration and Ecosystem

Skilled Insights

Clarifai Product Integration & Finest Practices

Clarifai’s GH200 Internet hosting

Finest Practices for Deploying on GH200

Skilled Insights

Regularly Requested Questions

Conclusion

Vishing assaults used for information theft

ShinyHunters declare accountability